Get reliable human feedback on your LLM traces.

Integrate Langfuse with Testable Minds to gather high-quality evaluations from our pool of 100k+ verified participants. Improve your models with feedback you can trust.

Connect with Langfuse

Powering Cutting Edge Research At

Human evaluation in 3 steps

Turn Langfuse traces into actionable, high-quality human scores and insights without building your own ops.

Connect with Langfuse

Link your Langfuse project (cloud or self-hosted) with API keys.

Define what to score

Pick score configs and select which traces to send (tags or environments).

Get results back in Langfuse

We route work to verified participants with quality checks, then sync scores to your traces.

Our Minds contribute to cutting-edge research at leading institutions.

“

I've now used Testable Minds for studies across several years and I am impressed by the data quality.

Rob McIntosh

Senior Lecturer

“

The process for resolving disputes is excellent and fair for both researchers and participants.

Jennifer Murphy

Dr. Lecturer

“

Simple and versatile, there is no need to spend hours in the lab to collect large samples.

Alejandro J. Estudillo

Senior Lecturer

“

Getting real-world feedback on our RAG pipeline was a game-changer. The integration made it incredibly simple.

David Chen

ML Scientist

Estimate cost

Traces

Total tokens / trace

Evaluations / trace

Estimated workload: Total evaluation time: --

Estimated Total Cost

$0.00

Calculated based on estimated time to read and evaluate.

Start Evaluation

Frequently Asked Questions

Everything you need to know about the integration.

It is an automated "Human-in-the-Loop" evaluation system. Testable Minds periodically fetches your specific Langfuse traces (filtered by tags or environments), distributes them to human evaluators, and automatically pushes their feedback back into Langfuse as scores.

Yes. The integration supports Langfuse Cloud (US and EU/HIPAA) as well as self-hosted deployments. You simply need to provide the correct base URL and API keys during setup.

Evaluators are drawn from the Testable Minds participant pool. These are real humans with verified identities and approval rates above 90%. You can further refine who evaluates your traces by setting specific demographic criteria and gender balance requirements.

We use a two-step approach:

Vetted Participants: We only use pre-screened participants with a history of high-quality work.
Attention Checks: Every evaluation session includes built-in attention checks. Responses that fail these checks are automatically excluded to ensure your Langfuse scores remain accurate.

Currently, the integration supports text-based input and output data.

Yes. You don't have to evaluate everything. You can configure the integration to only pull traces that match specific Langfuse tags (e.g., external_eval) or environment labels (e.g., staging vs production).

Yes. You can select multiple Langfuse score configurations (numeric, categorical, or boolean) for a single study. Participants will evaluate the trace against all selected criteria, and individual scores will be returned for each config.

This is a budget-based system. Costs are dynamically calculated based on the length of the trace, the number of questions (score configs) you ask, and the number of human respondents you require per trace. You can top up your budget directly in your Testable account.

Yes. You confirm PII (Personally Identifiable Information) compliance before launching a study. We only import the specific traces you authorize via your filters, and data is used solely for the evaluation session.

The system polls for new traces hourly. Once enough unassigned traces (default 20) are available, a batch is automatically created for human review. As soon as the reviews are complete, scores are pushed back to Langfuse.

For a detailed step-by-step walkthrough of the setup process—including how to generate API keys and configure your study settings—please refer to the official Langfuse → Testable Minds Integration Guide .