What debugging AI agents taught me about observability

The first time I got a “your agent did something weird” message from a user, I did the usual thing:

Pull logs
Search for an error
Find nothing useful

And then I hit the wall: agents don’t fail like normal code.

When an LLM is making decisions, “INFO: tool called” is basically the same as saying “something happened.”

This is what debugging agents taught me about observability.

The Problem with Traditional Logging

Traditional apps are mostly deterministic. You log:


INFO: user requested action X
INFO: fetched data from API
INFO: returned result Y

That gives you a narrative.

With agents, the naive version looks like:


INFO: agent received task
INFO: agent called tool
INFO: agent returned response

…and it tells you nothing that helps you answer the question you actually care about:

Why did it do that?

[INTERNAL LINK: relevant post on LangGraph agent architecture or production structure]

The First Mistake I Made

I treated “agent logging” like normal application logging. I tried to keep it minimal.

I was worried about:

log volume
cost
privacy

Those are real concerns.

But the tradeoff is harsh: when you don’t log enough, you pay for it later in debugging time and customer trust.

The Breakthrough

Here’s what actually happened when I stopped treating observability as a nice-to-have and started treating it as a first-class feature of the system: debugging went from “we have no idea what happened” to “we can replay the exact moment things diverged.” Three things made that shift possible.

What You Actually Need (My Current Baseline)

After enough incidents, I now log three things religiously.

1. Full state at decision points

Every time the agent makes a decision, I want enough data to reconstruct the context. Here’s the tradeoff I keep coming back to: verbose logging feels expensive until the first time it saves you from a two-day debugging session.

Here’s a pattern that’s worked well for me:

type AgentState = {
  traceId: string;
  context: Record<string, unknown>;
  messages: Array<{ role: string; content: string }>;
};

export function logDecisionPoint(params: {
  logger: { info: (obj: unknown) => void };
  state: AgentState;
  decision: string;
  alternatives: string[];
  tokenBudgetRemaining: number;
}) {
  const { logger, state, decision, alternatives, tokenBudgetRemaining } =
    params;

  logger.info({
    event: "decision",
    traceId: state.traceId,
    decision,
    alternatives,
    contextKeys: Object.keys(state.context),
    messageCount: state.messages.length,
    lastMessagePreview: state.messages.at(-1)?.content.slice(0, 200) ?? "",
    tokenBudgetRemaining,
  });
}

This is expensive. It’s also the difference between being able to replay the exact moment things diverged and having nothing to work with at all.

2. Tool inputs and outputs (verbatim)

I don’t summarize tool calls anymore.

If a tool call matters, I log:

the input payload
the output payload
timing
errors

type Logger = {
  info: (obj: unknown) => void;
  error: (obj: unknown) => void;
};

export async function tracedToolCall<
  TInput extends Record<string, unknown>,
  TOutput,
>(params: {
  logger: Logger;
  toolName: string;
  toolInput: TInput;
  execute: (toolName: string, toolInput: TInput) => Promise<TOutput>;
}): Promise<TOutput> {
  const { logger, toolName, toolInput, execute } = params;
  const start = performance.now();

  logger.info({ event: "tool_call_start", tool: toolName, input: toolInput });

  try {
    const output = await execute(toolName, toolInput);
    logger.info({
      event: "tool_call_success",
      tool: toolName,
      output,
      durationMs: performance.now() - start,
    });
    return output;
  } catch (e) {
    logger.error({
      event: "tool_call_error",
      tool: toolName,
      error: e instanceof Error ? e.message : String(e),
      errorType: e instanceof Error ? e.name : "Unknown",
    });
    throw e;
  }
}

3. The prompt (every time)

This is the one I avoided the longest.

Nobody tells you this, but the prompt is code. If you can’t reproduce the prompt, you can’t reproduce the behavior.

export async function logLLMCall(params: {
  logger: Logger;
  prompt: string;
  response: string;
  model: string;
  usage: { inputTokens: number; outputTokens: number };
  costUsd: number;
  store: (data: {
    prompt: string;
    response: string;
    metadata: Record<string, unknown>;
  }) => Promise<void>;
}) {
  const { logger, prompt, response, model, usage, costUsd, store } = params;

  logger.info({
    event: "llm_call",
    model,
    promptLength: prompt.length,
    responseLength: response.length,
    inputTokens: usage.inputTokens,
    outputTokens: usage.outputTokens,
    costUsd,
  });

  await store({
    prompt,
    response,
    metadata: { model, usage, costUsd },
  });
}

[INTERNAL LINK: relevant post on token cost management or LLM cost optimization]

The Debugging Workflow

When something breaks, here’s the workflow that stops me from flailing:

Find the trace ID
Reconstruct the state timeline from decision logs
Identify the divergence point
Replay using the exact prompt + model

Step 4 is the real unlock. Inputs alone are not enough — the prompt at that specific moment, with that specific context, is what you need to reproduce the behavior. Without logging it, step 4 is impossible.

The Privacy Tradeoff (And How I Think About It)

Logging prompts and tool payloads can easily become “logging user data.”

My current approach:

log everything in dev and staging
in production, be deliberate about:
- retention windows
- PII redaction
- access controls
- sampling strategies

It’s not perfect. It’s a moving target. But the alternative — not logging enough — is a worse tradeoff.

[INTERNAL LINK: relevant post on privacy considerations in AI agent systems]

Your Turn

If you’re running agents in production: what’s the one thing you wish you had logged from day one — and did it change how you designed your state or your tracing setup?

I’m discussing this on LinkedIn and X — come share your thoughts there.