Langfuse Experiments Rebuild: What LLM Devs Need to Know (2026)

Langfuse rebuilt experiments as a first-class concept on April 13, 2026. Here is what changes for LLM evaluation workflows and how to migrate this week.

On April 13, 2026, the Langfuse team shipped an experiments rebuild that promotes experiments to a first-class concept inside the platform. If you ship LLM apps that need versioned evaluations, this is the most consequential Langfuse change of 2026 so far. The Langfuse experiments rebuild moves runs out from under datasets and into their own top-level surface, and it changes how LLM developers structure regression tests, eval pipelines, and prompt comparisons.

What changed in the Langfuse experiments rebuild

Before April 13, an experiment in Langfuse only existed in the context of a dataset. You opened a dataset, picked the experiments tab, and viewed runs against that fixed input set. That coupling worked for small, hand-curated datasets, but it broke down once teams started running evals against sampled production traces or pushing custom data straight from CI scripts.

The rebuild lifts experiments to the same level as Datasets in the Langfuse navigation. You now get a project-wide experiments view, comparison across runs that used different input sources, and progress tracking over time. Each run is still an immutable snapshot, but the snapshot can come from a curated dataset, a sliced production trace window, or any data your evaluation script pushed via the Langfuse SDK v4.

The other quiet but important shift: experiments now have stable identities you can reference from the Python and TypeScript SDKs without traversing a dataset object first. That makes experiments easier to wire into GitHub Actions, Dagger pipelines, or any pre-deploy gate where you want to fail the build if the eval score regresses against a baseline run.

Why LLM builders should care

Most teams shipping production LLM features still treat evaluations as an afterthought: a one-off notebook, a Slack screenshot of GPT-4 vs Claude side-by-side, maybe a CSV stored in someone's Drive. That works until a prompt change quietly breaks 8% of edge cases and you find out from a customer ticket. The Langfuse experiments rebuild makes the disciplined version of evals cheap enough that there's no excuse not to do it.

Three concrete payoffs for builders. First, regression catching: every prompt edit, model swap, or RAG retriever tweak can be measured against the prior best run on the same data. Second, multi-source eval: you can finally compare a curated golden-set run against a sampled-from-prod run in one view, which is how you catch the gap between staging accuracy and the messy real world. Third, history as a graph: experiments now plot over time, so you see whether your "improvements" actually trended a metric up or just moved the needle on cherry-picked examples.

If you've been on the fence about adding observability to your stack, this rebuild is a good forcing function. For a broader landscape of options, see our 2026 LLM observability platforms guide. Langfuse is open source under MIT and now has 26K GitHub stars, so the lock-in cost is genuinely low — you can self-host on Docker Compose or run on the cloud free tier while you wire up the first eval.

What changes in your evaluation workflow

If you already use Langfuse experiments, the migration is mostly mental. The old dataset-tab path still works, but the new top-level Experiments view is where you'll spend time. Three workflow shifts matter most.

1. You can run experiments without a dataset. Push any list of inputs and outputs to langfuse.experiment() via the SDK and it lands in the experiments view as a standalone run. This unblocks the common pattern of "I want to eval the last 200 prod traces tagged support-chat" — no dataset object needed.

2. Compare-across-runs now works across data sources. The compare view used to demand both runs share a dataset. The rebuild drops that constraint. You can put a golden-set run next to a prod-sample run and diff scores at the row level, which is the closest the OSS world has come to a proper eval IDE.

3. SDK v4 experiments live alongside categorical and boolean LLM-as-a-judge scores. Langfuse shipped boolean judges on April 8 and categorical judges on March 20. Combined with the rebuild, you can run an experiment that emits a boolean pass/fail per row plus a categorical bucket, then compare runs on either dimension. That's a pattern you used to need Promptfoo or a custom harness to do cleanly.

Langfuse experiments rebuild top-level view showing run comparison — The new top-level Experiments tab in Langfuse 2026. Source: Langfuse changelog.

5 action items for this week

If you ship LLM features and don't yet have a versioned eval workflow, here is a one-week plan to get the new Langfuse experiments paying for themselves.

Spin up Langfuse on Docker Compose (15 minutes). Clone langfuse/langfuse, copy .env.dev.example to .env, and run docker compose up. Point your app's OpenAI or Anthropic client at the Langfuse SDK wrapper and traces will start flowing on port 3000.
Tag the 5 highest-stakes prompts in your codebase. Pick the prompts where a regression would hurt customers, not the ones that change daily. Wrap their LLM calls with Langfuse's @observe() decorator (Python) or observe() wrapper (TypeScript) so traces include input/output pairs you can later sample.
Build a golden set of 30-50 cases. 50 is enough for noise to wash out. Pull representative inputs from prod traces, hand-write expected outputs or pass criteria, and push them as a Langfuse dataset. Don't aim for 500 — that's a procrastination trap.
Wire one boolean LLM-as-a-judge eval. Use the new judge feature to score each output on a single yes/no axis (e.g. "does the response cite the source document?"). Run it against the golden set and save the experiment as your baseline.
Add a CI check that compares the latest run to the baseline. Use the SDK to fetch experiment scores and fail the build if the pass rate drops more than 3 percentage points. This is the step that turns evals from a checklist item into an actual safety net. If you're also exploring AI-driven tools for indie founders in 2026">Tools for Multi-Model LLM Apps in 2026">Tools for Developers in 2026">test generation, our AI test generation tools roundup covers complements like Qodo and Codium.

That's roughly an afternoon for the setup and a couple of focused sessions for the eval design. The compounding starts on day one: every subsequent prompt change runs against a measurable baseline.

What to watch next

Two threads worth tracking after the experiments rebuild. First, ClickHouse-owned Langfuse (the acquisition closed January 2026) is leaning harder into analytics depth — expect richer aggregations on experiment scores in the next two release cycles, and likely SQL-style ad-hoc queries against trace data. Second, the Langfuse CLI shipped February 17 is becoming the path of least resistance for syncing prompts and datasets between branches; pair it with the experiments rebuild and you get a Git-native eval workflow without bolting on a separate tool.

If you're building agent-heavy systems where every step needs traceability, the rebuild also pairs cleanly with reproducible config patterns — see our writeup on reproducible AI agent configs. And if you're benchmarking the underlying model layer, the Claude Opus 4.7 deep dive covers what changed at the model layer this month.

FAQ

Is the Langfuse experiments rebuild a breaking change? No. Existing dataset-scoped experiments still appear in their original location and continue to work. The rebuild adds a new top-level view and new SDK paths; it doesn't deprecate the old ones.

Do I need Langfuse Cloud or can I self-host? Both work. The OSS edition under MIT license includes the experiments rebuild. Self-host on Docker Compose, Kubernetes via Helm, or one of the Terraform modules for AWS/GCP/Azure. Cloud is faster to start; self-host is faster to prove out compliance.

How does Langfuse experiments compare to Promptfoo or LangSmith? Promptfoo is CLI-first and excellent for one-shot regression suites; Langfuse experiments are platform-native, persistent, and tied to your trace data. LangSmith is the closest commercial peer but is closed source and tied to LangChain. The experiments rebuild narrows Langfuse's gap with LangSmith on the eval surface specifically.

What SDK version do I need? Langfuse Python SDK 3.x and TypeScript SDK v4 GA both support the new experiments paths. Older v2 TypeScript SDKs work for tracing but lack the experiment helpers; upgrade if you want CI-native flows.

What changed in the Langfuse experiments rebuild

Why LLM builders should care

What changes in your evaluation workflow

5 action items for this week

If you ship LLM features and don't yet have a versioned eval workflow, here is a one-week plan to get the new Langfuse experiments paying for themselves.

Spin up Langfuse on Docker Compose (15 minutes). Clone langfuse/langfuse, copy .env.dev.example to .env, and run docker compose up. Point your app's OpenAI or Anthropic client at the Langfuse SDK wrapper and traces will start flowing on port 3000.
Tag the 5 highest-stakes prompts in your codebase. Pick the prompts where a regression would hurt customers, not the ones that change daily. Wrap their LLM calls with Langfuse's @observe() decorator (Python) or observe() wrapper (TypeScript) so traces include input/output pairs you can later sample.
Build a golden set of 30-50 cases. 50 is enough for noise to wash out. Pull representative inputs from prod traces, hand-write expected outputs or pass criteria, and push them as a Langfuse dataset. Don't aim for 500 — that's a procrastination trap.
Wire one boolean LLM-as-a-judge eval. Use the new judge feature to score each output on a single yes/no axis (e.g. "does the response cite the source document?"). Run it against the golden set and save the experiment as your baseline.
Add a CI check that compares the latest run to the baseline. Use the SDK to fetch experiment scores and fail the build if the pass rate drops more than 3 percentage points. This is the step that turns evals from a checklist item into an actual safety net. If you're also exploring AI-driven tools for indie founders in 2026">Tools for Multi-Model LLM Apps in 2026">Tools for Developers in 2026">test generation, our AI test generation tools roundup covers complements like Qodo and Codium.

That's roughly an afternoon for the setup and a couple of focused sessions for the eval design. The compounding starts on day one: every subsequent prompt change runs against a measurable baseline.

Langfuse Experiments Rebuild: What LLM Devs Need to Know (2026)

What changed in the Langfuse experiments rebuild

Why LLM builders should care

What changes in your evaluation workflow

5 action items for this week

What to watch next

FAQ

Related posts

Best LLM Observability Platforms for Anthropic and OpenAI Stacks (2026)

Best AI Gateway Tools for Multi-Model LLM Apps in 2026

Best AI Test Generation Tools for Developers in 2026

Comments (0)

Langfuse Experiments Rebuild: What LLM Devs Need to Know (2026)

What changed in the Langfuse experiments rebuild

Why LLM builders should care

What changes in your evaluation workflow

5 action items for this week

What to watch next

FAQ

Related posts

Best LLM Observability Platforms for Anthropic and OpenAI Stacks (2026)

Best AI Gateway Tools for Multi-Model LLM Apps in 2026

Best AI Test Generation Tools for Developers in 2026

Comments (0)