AI Agent Observability: How to See What Coding Agents Are Actually Doing

What it is: A practical guide to seeing what AI coding agents actually do during a run — the traces, token counts, tool calls, and eval signals that turn the agent from a black box into something you can audit, debug, and improve. Covers the four observability pillars, the OpenTelemetry GenAI standard, and 10 tools (Langfuse, Helicone, Phoenix, Logfire, Weave, Honeycomb, Datadog, LangSmith, Opik, Galileo) with May 2026 pricing.
Who it is for: Developers using Claude Code, Cursor, or Codex who’ve felt their agent disappear into a 20-minute black box and come back with output they can’t audit.
Best if: You want named tools, real prices, and a 30-minute MVP setup — not a theory tour.
Skip if: You haven’t written a script that calls an LLM API yet. Start with Why AI Coding Agents Fail first. Want one practical AI workflow every morning? Subscribe to our free daily newsletter.

Heads up — this is a more intermediate AI topic. Brand-new to AI? Start with How to Use AI and Best AI Tools for Beginners first. This post assumes you’ve used a coding agent at least once and are comfortable adding an SDK to a Python or Node project. It builds on Why AI Coding Agents Fail, What is AGENTS.md?, Long-Running Claude Code Tasks, and AI Agent Verification.

Table of Contents

What does “agent observability” actually mean?

Agent observability is the practice of recording what an autonomous agent did, why, in what order, how long, and at what cost — enough signal that you can audit a 20-minute run you didn’t watch in real time. It’s different from app-level observability (HTTP latency, database queries, error rates) for one important reason: agents are non-deterministic. The same prompt can produce different traces. So the basic question isn’t just “did this break?” but “what did it actually do, and was that what I wanted?”

Most developers feel this as the “black box” problem. The coding agent goes off for 20 minutes, returns a diff, and the user has no way to verify which files it read, which shell commands it ran, what it searched for, or why it picked one approach over another. Without observability you cannot audit, debug, replicate, or improve the run.

Running an agent without observability works fine for one or two sessions. It breaks down past that. Failure modes (the agent looping, calling the wrong tool, hallucinating a file path) become invisible until either cost or user complaints surface. The harness-engineering canon (Anthropic, Martin Fowler, HumanLayer) all converge on the same point: observability is part of the harness, not an afterthought.

What are the four pillars of agent observability?

Traces. Every model call, tool call, and sub-agent invocation in the order it happened, with parent/child spans. This is the timeline view — what happened first, what called what, where the run spent its time.
Tokens and cost. Input tokens and output tokens per call, dollar cost per run, per user, per task. Overage alerts when a run exceeds a budget. Cost is the single most actionable observability signal for most teams — it tells you immediately when an agent went off the rails.
Tool calls. Which MCP servers, functions, or shell commands were invoked. With arguments and results. This is where you discover the agent called rm -rf by accident, or hit the production API instead of staging, or wasted 14 minutes calling the wrong search function.
Quality and eval signals. Did the tests pass after the agent’s changes? Did the user accept the diff? Did an LLM-judge score the output as correct? These are the only signals that tell you the run was actually useful. The first three pillars tell you what happened; this one tells you whether it mattered.

Every serious observability tool covers the first three. The fourth (quality signals) is where the tools differentiate — some bake in evals, some leave it to you.

What is OpenTelemetry GenAI and why does it matter?

OpenTelemetry is the cross-vendor standard for observability. The OpenTelemetry GenAI semantic conventions, currently at version 1.40.0 (April 2026), specify how to instrument AI workloads in a vendor-neutral way. They cover four areas: LLM client spans, agent spans, events (prompt and completion content), and metrics. The LLM-client spans exited “experimental” in early 2026 and are now stable; the agent and framework spans are still labelled Development but they’re stable enough to use in production.

Why this matters for you: if you instrument your agent with OTel once, you can swap from Langfuse to Honeycomb to Datadog without re-instrumenting. Vendor portability is the point. The highest-signal items to instrument: create_agent and invoke_agent lifecycle spans, execute_tool for tool calls, the metric gen_ai.client.operation.duration (required), and gen_ai.client.token.usage (recommended).

A meaningful 2026 update: Claude Code ships native OpenTelemetry support. Set CLAUDE_CODE_ENABLE_TELEMETRY=1, point the OTLP exporter at Honeycomb, Datadog, Langfuse, or Logfire, and you get model name, tool name, token counts, and duration recorded automatically. Prompt and completion content are off by default for privacy.

Which observability tools should you actually consider?

The space has consolidated fast. As of May 2026, ten tools are worth knowing about. Pricing is current retail.

Langfuse — MIT-licensed, open-source, self-hostable. Hobby free (50K units/month), Core $29/mo, Pro $199/mo, Enterprise $2,499/mo. Acquired by ClickHouse in January 2026 alongside a $400M Series D; MIT licence confirmed to stay. The default open-source pick.
Helicone — open-source AI gateway (proxy). You change the base URL of your API client, observability is automatic. Fastest install in the category. Free tier; Pro $79/mo.
Arize Phoenix — open-source under Elastic License v2. Notebook-first. The deepest eval primitives in the open-source category. Strong integrations with the OpenAI Agents SDK and LlamaIndex.
Logfire (Pydantic) — OpenTelemetry-native, full-stack (LLM call, database query, frontend event in one trace), SQL-queryable, generous free tier. Best if you’re already in the Python ecosystem.
Weights & Biases Weave — eval-first. Natural fit for teams already on W&B for model training. Multi-turn agent flows and real-time alerting still maturing.
Honeycomb — high-cardinality general observability with an Agent Timeline view that shows user input → LLM call → tool call in one strip. Best if you want one platform for both your app and your agents.
Datadog LLM Observability — pays back if your stack is already on Datadog. Correlates agent steps with backend service traces and infrastructure metrics in one platform.
LangSmith — the default if you build with LangChain or LangGraph (auto-instrumentation). Free 5K traces/month; Plus $39/seat/mo.
Comet Opik — Apache 2.0 open source. Fastest trace logging in published benchmarks (about 23 seconds vs. Phoenix at 170s and Langfuse at 327s for the same workload).
Galileo — leads on RAG groundedness evaluation. Their Luna-2 judge models offer real-time guardrails.

For a solo developer or small team, the practical narrowing: Langfuse (if you want self-host as an option later) or Logfire (if you want OTel-native and full-stack tracing). Both have free tiers that cover real work. Skip the enterprise tools until you have enterprise-level volume to put through them.

Want a daily breakdown of new agent tools and patterns? The free Beginners in AI daily brief ships one practical AI workflow per day. Plain English, no tech background required.

What should you instrument in your own agent loop?

If you’re building or wrapping an agent yourself, seven fields are worth capturing on every turn:

Start and end timestamps. Lets you measure latency per call and end-to-end run time.
Model name and version. Claude Sonnet 4.6 is not Claude Sonnet 4.7. Mixing the two on the same eval suite gives you garbage data.
Input and output tokens. The headline cost driver.
Computed cost. Tokens times the published rate for that model, in dollars. Roll up per run, per user, per task.
Tool calls. Which tool was invoked, the arguments passed, the result returned, the time taken.
Errors and retries. If the agent retried three times before succeeding, you want to know. If it succeeded once and never logged the attempt, you don’t know what’s normal.
User-perceived outcome. Did the user accept the diff, reject it, or edit it before committing? This is the only signal that tells you the run was actually useful. Without it, you have telemetry but no ground truth.

The user-outcome field is the one most teams forget. The first three pillars (traces, tokens, tools) are mechanical. The fourth (quality) requires you to actually wire up the feedback signal. Without it, you’re optimising in the dark.

What’s the simplest MVP setup for a beginner?

Six steps. About 30 minutes start to finish.

Step 1. Sign up for Logfire or Langfuse free tier. Either is fine; pick by personal taste.
Step 2. Install the SDK (one pip install or npm install).
Step 3. Add three lines of init code at the top of your agent script with your API key.
Step 4. Wrap your agent loop. Both tools have a one-liner decorator or context manager for this.
Step 5. Run your agent on one real task. Open the dashboard. Confirm you see the trace.
Step 6. Set a cost alert. A single per-day budget alert is enough to start.

That’s it. You now have traces, token counts, and cost recorded automatically. The next 30 minutes of optional polish: wire the user verdict (accepted / rejected / edited) into a tag on each trace, and run a weekly review of the runs that cost the most.

What does production-grade agent observability look like?

Sampling strategy. At high volume, you don’t want to trace every successful run — the storage and cost compound. Standard pattern: trace 100% of failures and a fixed percentage (often 10%) of successes. The failures are where the signal lives.
PII redaction. Prompt and completion content often include sensitive data (customer emails, account numbers, code snippets that contain secrets). Redact before exporting to your observability platform. Every major tool has SDK hooks for this.
Cost alerts per project. If a single run normally costs $0.30, you want to be paged when one costs $30. Cost alerts are the cheapest production safeguard against runaway loops.
Nightly eval pipeline. Run an LLM-judge or your test suite over yesterday’s traces. Score each one. Flag regressions. This is how you catch quality degradation before users do.
Claude Code OTel hookup. Set CLAUDE_CODE_ENABLE_TELEMETRY=1 and point the OTLP exporter at your existing platform. You get model, tool, token counts, and duration for every Claude Code session your team runs. Prompt content is off by default for privacy.

What observability anti-patterns should you avoid?

Logging every prompt and completion in full. Cost balloons. PII risk balloons. Signal drowns in noise. Sample instead.
Not capturing the prompt at all. You can’t reproduce a bug if you didn’t record what the user (or upstream system) asked. Capture the prompt; redact the sensitive parts.
Not capturing the final user verdict. The whole stack of telemetry is decoration without the one signal that proves the run worked.
Trusting the agent’s self-reported “Task complete.” The whole reason you’re reading this is that agents claim success on broken output. The observability fix is to record the eval signal, not the agent’s opinion. See AI Agent Verification for the full pattern.
Building your own dashboard before reaching for a tool. Every team has done this. It always takes longer than expected. Use a vendor tool until you genuinely outgrow it.

Frequently asked questions

Do solo developers actually need observability?

If you run agents for more than an hour a week, yes — cost alone justifies it. The free tier of Langfuse or Logfire is genuinely free, the setup is 30 minutes, and the first time the dashboard tells you an agent looped for $15 worth of tokens you’ll be glad you wired it up. For a single one-off scripted session a week, skip it.

Langfuse or Logfire — which one?

Langfuse if open-source-and-self-hostable matters to you (it’s MIT-licensed and you can run it on your own server forever for free). Logfire if you want full-stack tracing in a Python-first project (the Pydantic team’s tool, OTel-native, traces your LLM call alongside your database query and frontend event in one timeline). Either is a strong starting point.

Does Claude Code have its own observability?

Yes. Claude Code ships native OpenTelemetry support — set the environment variable CLAUDE_CODE_ENABLE_TELEMETRY=1 and point an OTLP exporter at any compatible platform (Honeycomb, Datadog, Langfuse, Logfire). You get model, tool, token counts, and duration automatically. Prompt content is off by default; you can opt it in.

What about Cursor and GitHub Copilot?

Cursor and GitHub Copilot have less mature external-observability stories than Claude Code does. Cursor exposes some usage data per workspace; Copilot’s team plan dashboard tracks acceptance rates. For deeper traces, you’d typically observe the model layer (the underlying LLM API) using your own proxy or gateway like Helicone.

How expensive is observability at production volume?

Roughly 1–3% of your LLM bill once you sample correctly. Self-hosted Langfuse on a small VM is effectively free past the one-time setup; cloud Langfuse at Pro ($199/mo) covers most small teams; Logfire’s free tier handles meaningful workloads; Datadog and Honeycomb scale with your existing observability spend. The runaway cost everyone fears comes from logging every prompt and completion in full at high volume. Sample 10% on the happy path and you stay well within budget.

Get Smarter About AI Every Morning

Free daily newsletter — one story, one tool, one tip. Plain English, no jargon.

Free forever. Unsubscribe anytime.

1-on-1 Coaching

Claude AI Crash Course

1-hour private video session with James. Walk through Claude Code OTel telemetry setup, Langfuse or Logfire wiring, sample dashboards, and cost-alert patterns for your codebase. Best for developers tired of agents disappearing into black boxes.

$75

1-hour live

Book session →

Group Format

AI Workshops for Teams

Team workshops for engineering departments standardising AI agent observability — OTel instrumentation, eval pipelines, cost alerts, and dashboards across the team. Best for teams of 3+ developers running agents in production. Custom-built around your stack.

Custom

pricing

Get a quote →

Sources

Last reviewed: May 2026. Observability vendor pricing and OTel spec versions change quarterly — verify on the vendor pages above before committing to a tool.

Two ways to go further

The AI Prompt Library

1,000+ ready-to-use prompts for Claude, ChatGPT, and Gemini. Stop staring at a blank box.

Get it for $39 →

2-Hour Live AI Crash Course

A private, beginner-friendly session across Claude, ChatGPT, Gemini, and the wider landscape.

Book for $125 →

What Are Gemini Gems? A Guide

Best AI Prompts for HR

What Is Google Gemini? A Guide