🔬

Chapter 18 · 11 min read

Debugging & Observability

Your agent did something weird at 3 AM. Here's how to figure out why.

Your agent ran a cron job at 3 AM. Something went wrong. The output was weird. How do you figure out what happened? This chapter gives you the debugging and observability toolkit.

🍕 Real-life analogy

Imagine you hire a night-shift worker. They work while you sleep. In the morning, the work is done — but something's off. Without security cameras (logs), a task checklist (traces), and a supervisor's report (summaries), you'd have no idea what went wrong. Observability is your security camera system for AI agents.

Agent thinking...

▊

The 3 Pillars of Agent Observability

1. 📋 Logs — What Happened

Raw record of every action. Input → thinking → output → tool calls → results. The foundation of debugging.

2. 🔗 Traces — The Full Journey

Connected chain of events: trigger → model call → tool use → response → delivery. Shows cause and effect.

3. 📊 Metrics — The Trends

Token usage over time, error rates, response latency, cost per task. Tells you if things are getting better or worse.

Common Agent Failures & How to Debug Them

🐛 "The output was wrong/hallucinated"

Debug steps:

1. Check the input prompt — was context missing?
2. Check which model was used — cheaper models hallucinate more
3. Check if knowledge base files were accessible
4. Check context window — was it full/truncated?
5. Fix: Add missing context to knowledge base, or upgrade model for that task

🐛 "The cron job didn't run"

1. Check cron expression — is the timezone correct?
2. Check if the gateway was running at scheduled time
3. Check API key validity — expired keys fail silently
4. Check rate limits — were you throttled?
5. Fix: Add a heartbeat check that monitors cron execution

🐛 "The agent went off-script"

1. Check for prompt injection in input data
2. Check if system prompt was too vague or contradictory
3. Check conversation history — did it drift over many messages?
4. Check if it hit a tool error and improvised badly
5. Fix: Tighten system prompt, add guardrails, use isolated sessions for risky tasks

🐛 "Costs spiked unexpectedly"

1. Check for infinite loops (agent retrying failed tool calls)
2. Check context window size — bloated history = expensive
3. Check if a cron job ran more often than expected
4. Check if model was accidentally set to Opus/o3 for routine tasks
5. Fix: Set max iterations, compact context, fix model routing

🔌 Observability by Platform

🐾 OpenClaw

Built-in Debugging

# View session history
openclaw sessions list --active 60

# Check cron job runs
openclaw cron runs --job "Trading Plan" --limit 5

# View session logs
openclaw sessions history --session <key> --include-tools

# Check gateway status
openclaw status

# Monitor in real-time
# Add a daily self-diagnostic cron:
openclaw cron add --name "Daily Health Check" \
  --cron "0 22 * * *" --session isolated \
  --message "Run a self-diagnostic:
  1. Check all cron jobs ran today (list runs)
  2. Check for any errors in recent sessions
  3. Report: tasks completed, tasks failed, total cost
  Post summary to Discord."
  --model "haiku" --announce

🤖 Claude API — Built-in Logging

Python Logging Setup

import anthropic
import json
from datetime import datetime

client = anthropic.Anthropic()

def logged_completion(prompt, model="claude-sonnet-4-20250514"):
    """Wrapper that logs every API call"""
    start = datetime.now()
    
    response = client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    
    log_entry = {
        "timestamp": start.isoformat(),
        "model": model,
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
        "latency_ms": (datetime.now() - start).total_seconds() * 1000,
        "prompt_preview": prompt[:100],
        "stop_reason": response.stop_reason,
    }
    
    with open("logs/agent.jsonl", "a") as f:
        f.write(json.dumps(log_entry) + "\n")
    
    return response

🚀 CrewAI / LangChain — LangSmith & Arize

• LangSmith: Automatic tracing for LangChain. See every chain step, token count, latency
• Arize Phoenix: Open-source observability. Local dashboard for traces + evals
• CrewAI verbose=True: Prints every agent step to console — basic but effective
• OpenTelemetry: Industry standard. Export traces to any observability platform

⚡ n8n / Make / Zapier

• n8n: Built-in execution history. Click any run to see input/output for every node
• Make: Scenario history with full data flow visualization
• Zapier: Task history with per-step data. Set up error notifications
• Pro tip: Add a "log to spreadsheet" node at the end of every workflow for your own analytics

💻 Cursor / Windsurf / Cline

• Cursor: Check Settings → Usage to monitor token consumption
• Git diff: Review what the agent changed with git diff before committing
• Undo: Use git to revert bad changes: git checkout -- .
• Cline: Shows full conversation log in the sidebar — review reasoning

The "Morning After" Checklist

Daily Review Prompt (5 minutes)

Quick daily review checklist:

1. Did all scheduled cron jobs run? ✅/❌
2. Any error messages in logs? ✅/❌
3. Token usage within budget? ✅/❌
4. Output quality acceptable? ✅/❌
5. Any unexpected behaviors? ✅/❌

If all ✅ → Great, move on.
If any ❌ → Debug using the failure patterns above.

The "Time Travel" Debug Technique

The most powerful debugging technique for agents: reproduce the exact conditions. When something goes wrong at 3 AM, you need to see what the agent saw.

Time Travel Debugging

# Step 1: Find the failing run
openclaw cron runs --job "Trading Plan" --limit 1

# Step 2: Check what files existed at that time
git log --oneline --until="2026-02-22T06:00:00" -5

# Step 3: Check what the agent's memory looked like
git show HEAD~2:memory/2026-02-21.md

# Step 4: Re-run with the same context
openclaw cron run --job "Trading Plan" --force
# This re-runs the exact same prompt in a fresh session

# Step 5: Compare outputs
# Old output (from logs) vs new output (from re-run)
# If they match → the input was the problem
# If they differ → the model was non-deterministic (use temperature 0)

Building Your Dashboard

After a month of running agents, you'll want a dashboard. Here's the minimal viable monitoring setup:

Daily Health Cron

openclaw cron add \
  --name "Agent Health Dashboard" \
  --cron "0 22 * * *" \
  --session isolated \
  --message "End-of-day health check:

1. List all cron jobs and their last run status
2. Count total sessions today
3. Estimate today's API cost
4. Check for any errors in logs
5. Compare today's output quality to baseline

Format:
📊 **Agent Health — [Date]**
- Cron jobs: [X/Y ran successfully]
- Sessions: [N total]
- Est. cost: $[amount]
- Errors: [count] [brief description if any]
- Quality: [✅ Good / ⚠️ Check needed / ❌ Issues found]

If all green, keep it to 5 lines max." \
  --model "haiku" --announce \
  --channel discord --to "channel:YOUR_ID"

🔑 The #1 Debugging Rule

Always check the input before blaming the model. 90% of "the AI is broken" problems are actually "I gave it bad/missing context" problems. Check what went IN before analyzing what came OUT.

The Debugging Flowchart

When your agent does something weird, follow this exact sequence:

🔍

Check the input

What context/files did the agent actually receive? Was anything missing or malformed?

▼

📋

Check the prompt

Did your instructions clearly describe what you wanted? Any ambiguity?

▼

🧠

Check the context window

Did you exceed the token limit? Was important info truncated?

▼

🔧

Check the tools

Did the agent have access to the right tools? Did any tool calls fail?

▼

🎯

Check the output parsing

Did the agent output correctly but your parser misread it?

▼

🤖

THEN blame the model

Only after checking everything else. Try a different model or temperature.

Observable Agent Architecture

The best debugging setup is one where you can see everything without adding debug code. Build observability in from day one:

📝 Input logging — save every prompt sent to the model (with timestamps)
📝 Output logging — save every response received
📝 Tool call logging — which tools were called, with what args, and what they returned
📝 Decision logging — why the agent chose action A over action B
📝 Cost logging — tokens used per request (catches runaway costs early)

The "Replay" Technique

The most powerful debugging technique: replay the exact same input and see if you get the same output. If you do, the bug is deterministic (probably a prompt issue). If you don't, the bug is stochastic (probably a temperature or context window issue).

This is why input logging matters so much. Without the exact input, you can't replay. And without replay, you're guessing.

🧠 Quick Check

Your agent suddenly starts giving wrong answers about your project. It worked fine yesterday. What do you check first?

✅ Observability Setup Checklist

0/8 complete