Best LLM Observability Platforms for Anthropic and OpenAI Stacks (2026)

I instrumented a Claude Sonnet 4.6 + pgvector RAG app across 7 LLM observability platforms — Langfuse, Helicone, Phoenix, Braintrust, Opik, Traceloop, Lunary.

Picking the best LLM observability tools used to mean choosing between a Datadog dashboard hacked into your prompt logs or a homegrown SQLite trace viewer. In 2026 you have seven serious platforms, four of them open source, and pricing that starts at zero. This post benchmarks the tools I run in production for a Claude API + RAG stack: tracing depth, eval support, self-hosting, OpenAI-compatible proxying, and what a real bill looks like at 5M spans/month.

TL;DR: The 2026 winners

If you only read the table, here is which tool I would integrate this week by use case. All seven support Claude, OpenAI, Gemini, and any OpenAI-compatible endpoint via SDK or proxy.

Use case	Winner	Why	Free tier
Open-source self-host	Langfuse	Full eval + tracing + prompt mgmt, MIT license on core	50k events/mo cloud, unlimited self-host
Drop-in proxy with caching	Helicone	One-line base URL change, async logging, prompt caching dashboard	10k requests/mo
RAG and agent debugging	Arize Phoenix	OTel-native, span-level retrieval visualization, runs locally	Free OSS, paid SaaS optional
Eval-first workflows	Braintrust	Best-in-class eval UX, online + offline scoring	1M trace events/mo
Prompt iteration + datasets	Comet Opik	Apache 2.0, strong dataset versioning	Free OSS + 10k traces/mo cloud

Below I cover Traceloop OpenLLMetry and Lunary in the per-tool breakdowns, plus when to skip a dedicated platform and just use OpenTelemetry to your existing APM.

How I selected these tools

I spent two weeks instrumenting the same production app — a Claude Sonnet 4.6 chat backend with a pgvector RAG pipeline doing roughly 800k LLM calls per week — across every platform on this list. Tools that did not survive past day one are not in this article.

Selection criteria, weighted in this order:

Tracing depth. Can I see token counts, cost in USD, latency, and the full input/output for every call? Can I follow a conversation across multi-step agent runs?
Eval support. LLM-as-judge, regex/JSON checks, human review, and a way to run the same eval over a saved dataset.
Self-host story. Docker Compose to a working stack in under 10 minutes, plus a path to Kubernetes when traffic grows.
SDK ergonomics. Native Anthropic and OpenAI SDK wrappers in Python and TypeScript. OpenTelemetry export as a fallback.
Pricing predictability. Per-event pricing that does not surprise you at 10M spans, plus a generous free tier for prototyping.
Active maintenance. Commits in the last 14 days, public roadmap, response time on GitHub issues.

Tools I tested and dropped: Langsmith (tied to LangChain workflows that I do not run), Weights and Biases Weave (slower ingest, geared to ML training), Honeyhive (closed source, weak free tier), and a couple of YC startups that vanished mid-test. If your stack is LangChain-only, Langsmith is still the obvious pick — but the rest of the field has caught up.

Top 7 LLM observability tools, ranked

Ranking is opinionated. I weighted self-host viability, eval depth, and Anthropic SDK support more than logo count on the marketing page. Each tool gets a "best for / skip if" cut, current pricing, and the specific integration pattern I shipped.

LLM observability tools 2026 comparison overview chart — The seven LLM observability tools benchmarked against tracing, eval, and self-host criteria.

1. Langfuse

Best for: teams that want one open-source platform for tracing, evals, prompt management, and dataset versioning, with a credible self-host story. Skip if: you only need request logging and want zero infrastructure.

Pricing: Cloud free tier covers 50k events per month with 30-day retention. Pro is $59/month for 100k events plus $10 per additional 100k. Self-hosted is free and now ships v3 with ClickHouse-backed traces — handles 10M+ events on a $40/month VPS in my tests.

Integration: One decorator from langfuse.decorators in Python, a Langfuse() client constructor in TypeScript, or the OpenAI/Anthropic SDK wrappers. With Anthropic Python SDK 0.42:

from langfuse.anthropic import Anthropic
client = Anthropic()
client.messages.create(model="claude-sonnet-4-6", messages=[...])

Token counts, USD cost, latency, and full prompt+completion show up in the Langfuse UI within 2 seconds. Evals support LLM-as-judge with custom Claude 4.5 Haiku judges, plus Ragas for retrieval scoring. Prompt management lets you version prompts in the UI and pull them by name+version at runtime, which removes hardcoded prompts from your repo. Langfuse is currently trending in GitHub TypeScript repos this week — not coincidence; v3 ClickHouse migration shipped two weeks ago.

2. Helicone

Best for: teams that want observability with one base URL change and no SDK rewrite. The proxy pattern means every existing OpenAI or Anthropic client logs automatically. Skip if: you cannot route through a proxy (regulated environments) or you need deep eval workflows out of the box.

Pricing: Free for 10k requests per month. Pro is $20/seat with 100k requests included; usage above is $1 per 10k requests. Self-host is fully open source under Apache 2.0 — Docker Compose stack with ClickHouse and Postgres ready in under 5 minutes.

Integration: Change the base URL on your existing client. For Anthropic Python:

from anthropic import Anthropic
client = Anthropic(
  base_url="https://anthropic.helicone.ai/",
  default_headers={"Helicone-Auth": f"Bearer {HELICONE_KEY}"}
)

That is the entire integration. Helicone shines on prompt caching analytics: it surfaces which prompt prefixes hit the Anthropic cache and which miss, with exact USD savings calculated per request. The async logging mode adds zero latency — calls go straight to Anthropic and Helicone batches the metadata after. Evals are supported but feel like a v0 surface compared to Langfuse or Braintrust. Use Helicone for cost and latency, layer something else for evals if you need them.

3. Arize Phoenix

Best for: RAG and agent debugging where you need to see retrieval quality and tool-call traces side-by-side. OpenTelemetry-native, so it slots into existing observability stacks. Skip if: you want a hosted SaaS with multi-user RBAC out of the box — Phoenix is open source and self-host first.

Pricing: Phoenix the OSS project is free under Elastic License v2. Arize AX (the paid SaaS) starts at $50/month per user for production-scale deployments with audit logs and SSO. For solo and small-team builds, the OSS version covers everything.

Integration: Phoenix instruments via OpenInference, an OTel-compatible spec for LLM workloads. Three lines:

from phoenix.otel import register
from openinference.instrumentation.anthropic import AnthropicInstrumentor
register(project_name="my-rag-app")
AnthropicInstrumentor().instrument()

Phoenix's killer feature is the embedding visualization view: it runs UMAP on your retrieved chunks and the input query, then highlights when retrieval is failing because the query embedding lives in a different cluster than the relevant docs. For pgvector or any RAG pipeline, that single view has caught more bugs than my eval scripts did. Arize also ships phoenix evals with prebuilt judges for hallucination, relevance, and toxicity — useful, though Braintrust still wins on eval ergonomics.

4. Braintrust

Best for: teams whose primary pain is "is the new prompt better?" rather than "where did production break?" Braintrust treats evals as a first-class object: datasets, scorers, and experiments all version-controlled. Skip if: you need self-hosting (Braintrust is closed source SaaS) or your bottleneck is request logging, not eval iteration.

Pricing: Free tier covers 1M trace events per month with 14-day retention. Pro is $249/month for unlimited seats and 5M events; enterprise is custom and includes private cloud. Braintrust is the only tool here without a self-host option, which is the trade-off for the polish.

Integration: The braintrust Python and TypeScript SDKs wrap Anthropic and OpenAI calls. The interesting flow is the eval CLI:

npx braintrust eval my_eval.ts

Point it at a TypeScript file that exports a dataset, a task function, and a list of scorers. Braintrust runs the task across the dataset, applies scorers (LLM judges, regex, custom code), and shows you a diff against your last experiment. CI integration via GitHub Action gates merges on regression — I block PRs when the new prompt drops F1 below 0.85 on a 200-row golden set. It's the only tool where I keep the eval suite in version control next to my source code without it feeling bolted on.

5. Comet Opik

Best for: teams that want Langfuse's feature set with a slightly different bias toward dataset and experiment tracking, plus a familiar name behind it (Comet has been doing ML observability since 2017). Skip if: you do not want yet another vendor account; Langfuse covers similar ground with arguably more momentum.

Pricing: OSS is Apache 2.0, fully self-host with Docker Compose. Cloud free is 10k traces per month; paid plans start at $39/month for 100k traces. Pricing is competitive but a hair behind Helicone on the proxy use case and Langfuse on the eval depth.

Integration: Decorator-based, similar to Langfuse:

from opik import track
@track
def call_claude(prompt):
    return anthropic_client.messages.create(...)

Where Opik genuinely differs is dataset versioning and the integration with Comet's broader ML platform, which matters if you train fine-tunes alongside your inference stack. The "online evaluation" feature lets you run scorers against a sampled percentage of production traffic, which is something Langfuse only added late 2025. If you already use Comet for model training, Opik is the obvious unified pick. Otherwise, the choice between Opik and Langfuse comes down to UI taste — try both for an afternoon.

6. Traceloop OpenLLMetry

Best for: teams already running Datadog, New Relic, Honeycomb, or any OTel-compatible APM that want LLM spans flowing into the same backend without a second dashboard. Skip if: you want a purpose-built UI for prompt iteration or evals — OpenLLMetry is plumbing, not a product.

Pricing: The OpenLLMetry SDK is fully open source under Apache 2.0. Traceloop's hosted backend is optional; pricing starts at $50/month for 1M spans. Most teams I know use the SDK to ship to their existing APM and never touch the Traceloop SaaS.

Integration: One line at app boot:

from traceloop.sdk import Traceloop
Traceloop.init(app_name="prod-rag", api_endpoint="https://otel.your-apm.com")

The SDK auto-instruments Anthropic, OpenAI, Cohere, vector stores (Pinecone, Weaviate, pgvector via SQLAlchemy), and frameworks (LangChain, LlamaIndex, CrewAI). Spans follow the OpenTelemetry semantic conventions for GenAI, which were ratified in late 2025. If your platform team already standardized on OTel, this is the path of least resistance — and it pairs well with one of the dashboards above for the LLM-specific views OTel APMs do not yet handle. See the openllmetry GitHub repo for the current instrumentation matrix.

7. Lunary

Best for: solo builders and small teams who want a clean tracing UI without the enterprise feature bloat. Lunary is the lightest of the seven and the easiest to spin up. Skip if: you need advanced eval orchestration or your team is past 5 engineers and needs SSO/RBAC out of the box.

Pricing: Apache 2.0 licensed, free self-host. Cloud free covers 1k events per day (about 30k/month); paid starts at $20/month for 50k events. Cheapest paid tier in the lineup, which makes it a good fit tools for indie founders in 2026">for indie projects.

Integration: A wrapper around your LLM client:

import lunary
lunary.monitor(anthropic_client)

That single call patches every messages.create request to log to Lunary. The UI emphasizes user-level analytics — which user is generating the most cost, which prompts have the worst latency p95 — which matters more for B2C apps than for internal tools. Lunary added a prompt management console in late 2025 that competes with Langfuse, though the eval surface is still thin. For a side project on Anthropic's free $5 credits, Lunary plus the Lunary cloud free tier gets you full observability for $0/month, which is the use case it nails.

Honorable mentions

Langsmith. Still the strongest pick if your stack is LangChain or LangGraph end-to-end. The native integration shows agent state across nodes in a way no other tool matches. Outside the LangChain ecosystem the value drops sharply, which is why it did not make the main list. Pricing is $39/month per developer with 5k traces.

PromptLayer. Long-running player focused on prompt versioning and A/B testing. Lighter on tracing than the tools above. Worth considering if your team mostly iterates on prompts in the UI rather than in code. $50/month starting tier.

Tools I would no longer recommend in 2026: Vellum (pivoted toward enterprise workflows), HoneyHive (slow shipping), and any LLM observability tool whose last commit predates the GenAI OTel spec ratification — without that, you are buying into a non-standard schema.

How to choose

Pick by your top constraint, in this order:

Cannot send data to a third party. Self-host Langfuse, Phoenix, Helicone, or Opik. All four ship Docker Compose files that boot in under 10 minutes.
Already on OpenTelemetry / Datadog / Honeycomb. Start with Traceloop OpenLLMetry SDK pushing to your existing backend. Layer Phoenix on top if you need RAG-specific views.
Eval iteration is the bottleneck. Braintrust, then Langfuse. Skip the proxy-only tools.
Cost optimization is the bottleneck. Helicone for the prompt caching dashboard. Pair with Langfuse for everything else.
Solo builder, ship today. Lunary cloud free tier or Langfuse cloud free tier — both work in under 5 minutes.

And one anti-pattern: do not run two LLM observability tools simultaneously in production. The duplicate logging doubles your latency overhead and gives you two dashboards to reconcile. Pick one, learn it deeply, switch only if you hit a hard wall.

FAQ

Do I really need a dedicated LLM observability tool, or is my existing APM enough? If you are doing more than 100k LLM calls per month or running RAG with retrieval evals, you need LLM-specific tooling. Standard APMs treat LLM calls as opaque HTTP requests — no token counts, no cost tracking, no prompt diffing. The OpenTelemetry GenAI spec is closing the gap, but the LLM-specific dashboards still live in Langfuse, Phoenix, and friends.

Which LLM observability tool has the best Anthropic Claude support? Langfuse and Helicone are tied. Both ship native Anthropic Python and TypeScript wrappers, log token-level cost using Anthropic's published per-model pricing, and surface prompt cache hit rates. If you are building on the Anthropic API, either works. For more on Claude-specific patterns, see our Claude Opus 4.7 deep dive.

Can I self-host these tools on a $5/month VPS? Lunary and Helicone, yes. Langfuse v3 needs ClickHouse, so plan for a $20-40/month box at minimum. Phoenix runs locally for development but production traffic wants real infrastructure.

How do these tools handle PII and prompt redaction? Langfuse, Helicone, and Phoenix all support config-driven redaction patterns at the SDK layer before data leaves your app. Braintrust supports field-level encryption. For HIPAA workloads, only the self-hosted variants are defensible — verify with your compliance team before pointing prompts at any SaaS.

best LLM observability tools self-hosted Docker Compose stack — Self-hosted Langfuse + Helicone stack running on a single $40/month VPS for unlimited LLM tracing.

Try it this week

Pick one tool and one app. Spend 30 minutes wiring observability into your highest-traffic LLM endpoint, then leave it running for a week before deciding. The patterns that show up in real production traffic — cache miss spikes at 3 a.m., a single user generating 40% of cost, a prompt template silently drifting on Claude 4.7 — only surface when you instrument early. If you are still picking a model under those traces, our 2026 AI coding agents recap and multi-modal RAG walkthrough pair well with whatever observability tool you ship today. The best LLM observability tools are the ones already running before the next outage.

TL;DR: The 2026 winners

If you only read the table, here is which tool I would integrate this week by use case. All seven support Claude, OpenAI, Gemini, and any OpenAI-compatible endpoint via SDK or proxy.

Use case	Winner	Why	Free tier
Open-source self-host	Langfuse	Full eval + tracing + prompt mgmt, MIT license on core	50k events/mo cloud, unlimited self-host
Drop-in proxy with caching	Helicone	One-line base URL change, async logging, prompt caching dashboard	10k requests/mo
RAG and agent debugging	Arize Phoenix	OTel-native, span-level retrieval visualization, runs locally	Free OSS, paid SaaS optional
Eval-first workflows	Braintrust	Best-in-class eval UX, online + offline scoring	1M trace events/mo
Prompt iteration + datasets	Comet Opik	Apache 2.0, strong dataset versioning	Free OSS + 10k traces/mo cloud

Below I cover Traceloop OpenLLMetry and Lunary in the per-tool breakdowns, plus when to skip a dedicated platform and just use OpenTelemetry to your existing APM.

How I selected these tools

Selection criteria, weighted in this order:

Tracing depth. Can I see token counts, cost in USD, latency, and the full input/output for every call? Can I follow a conversation across multi-step agent runs?
Eval support. LLM-as-judge, regex/JSON checks, human review, and a way to run the same eval over a saved dataset.
Self-host story. Docker Compose to a working stack in under 10 minutes, plus a path to Kubernetes when traffic grows.
SDK ergonomics. Native Anthropic and OpenAI SDK wrappers in Python and TypeScript. OpenTelemetry export as a fallback.
Pricing predictability. Per-event pricing that does not surprise you at 10M spans, plus a generous free tier for prototyping.
Active maintenance. Commits in the last 14 days, public roadmap, response time on GitHub issues.

Top 7 LLM observability tools, ranked

1. Langfuse

Integration: One decorator from langfuse.decorators in Python, a Langfuse() client constructor in TypeScript, or the OpenAI/Anthropic SDK wrappers. With Anthropic Python SDK 0.42:

from langfuse.anthropic import Anthropic
client = Anthropic()
client.messages.create(model="claude-sonnet-4-6", messages=[...])

2. Helicone

Integration: Change the base URL on your existing client. For Anthropic Python:

from anthropic import Anthropic
client = Anthropic(
  base_url="https://anthropic.helicone.ai/",
  default_headers={"Helicone-Auth": f"Bearer {HELICONE_KEY}"}
)

3. Arize Phoenix

Integration: Phoenix instruments via OpenInference, an OTel-compatible spec for LLM workloads. Three lines:

from phoenix.otel import register
from openinference.instrumentation.anthropic import AnthropicInstrumentor
register(project_name="my-rag-app")
AnthropicInstrumentor().instrument()

4. Braintrust

Integration: The braintrust Python and TypeScript SDKs wrap Anthropic and OpenAI calls. The interesting flow is the eval CLI:

npx braintrust eval my_eval.ts

5. Comet Opik

Integration: Decorator-based, similar to Langfuse:

from opik import track
@track
def call_claude(prompt):
    return anthropic_client.messages.create(...)

6. Traceloop OpenLLMetry

Integration: One line at app boot:

from traceloop.sdk import Traceloop
Traceloop.init(app_name="prod-rag", api_endpoint="https://otel.your-apm.com")

7. Lunary

Integration: A wrapper around your LLM client:

import lunary
lunary.monitor(anthropic_client)

Honorable mentions

How to choose

Pick by your top constraint, in this order:

Cannot send data to a third party. Self-host Langfuse, Phoenix, Helicone, or Opik. All four ship Docker Compose files that boot in under 10 minutes.
Already on OpenTelemetry / Datadog / Honeycomb. Start with Traceloop OpenLLMetry SDK pushing to your existing backend. Layer Phoenix on top if you need RAG-specific views.
Eval iteration is the bottleneck. Braintrust, then Langfuse. Skip the proxy-only tools.
Cost optimization is the bottleneck. Helicone for the prompt caching dashboard. Pair with Langfuse for everything else.
Solo builder, ship today. Lunary cloud free tier or Langfuse cloud free tier — both work in under 5 minutes.

Best LLM Observability Platforms for Anthropic and OpenAI Stacks (2026)

TL;DR: The 2026 winners

How I selected these tools

Top 7 LLM observability tools, ranked

1. Langfuse

2. Helicone

3. Arize Phoenix

4. Braintrust

5. Comet Opik

6. Traceloop OpenLLMetry

7. Lunary

Honorable mentions

How to choose

FAQ

Try it this week

Related posts

Best AI Test Generation Tools for Developers in 2026

Microsoft APM 0.9.2: reproducible AI agent configs in 15 min

RAG-Anything: multi-modal PDF+image RAG in 20 min (2026)

Comments (0)

Best LLM Observability Platforms for Anthropic and OpenAI Stacks (2026)

TL;DR: The 2026 winners

How I selected these tools

Top 7 LLM observability tools, ranked

1. Langfuse

2. Helicone

3. Arize Phoenix

4. Braintrust

5. Comet Opik

6. Traceloop OpenLLMetry

7. Lunary

Honorable mentions

How to choose

FAQ

Try it this week

Related posts

Best AI Test Generation Tools for Developers in 2026

Microsoft APM 0.9.2: reproducible AI agent configs in 15 min

RAG-Anything: multi-modal PDF+image RAG in 20 min (2026)

Comments (0)