Cost Optimization & Model Selection
Stop burning $50/day on GPT-4 when $3/day gets you 90% there
Most people run GPT-4o for everything and wonder why their bill is $200/month. The truth? 90% of your agent's tasks don't need the smartest model. This chapter shows you how to cut costs 80% without sacrificing quality.
Why Most People Overspend (and How to Stop)
The #1 cost mistake isn't using the wrong model — it's using the right model for the wrong tasks. Here's what typical spending looks like before optimization:
Now the same setup, optimized with model routing:
Same output quality for 93% less cost. The heartbeat doesn't need Opus to check "any new emails?" The social monitor doesn't need Opus to count likes. Match the model to the task complexity, and your bill plummets.
The Context Window Tax
There's a hidden cost most people miss: context window bloat. Every message in a long conversation gets re-sent as context. A 50-message chat with a 10K-token system prompt means you're paying for that system prompt 50 times.
Each cron job starts fresh — no accumulated history. This alone can cut cron costs by 60-80% compared to running everything in the main session.
When your main session hits 100K tokens, run /compact. This summarizes old messages and frees up context, reducing the per-message cost of future interactions.
A 15K-token AGENTS.md means 15K tokens charged on every single message. Move detailed procedures to knowledge base files that are read on-demand, not loaded every turn.
The Model Tier Strategy
Use for: simple replies, formatting, classification, routine tasks
- • GPT-4o-mini — $0.15/1M input, $0.60/1M output
- • Claude 3.5 Haiku — $0.25/1M input, $1.25/1M output
- • Gemini Flash — $0.075/1M input, $0.30/1M output
Use for: content writing, analysis, code generation, research synthesis
- • Claude Sonnet — $3/1M input, $15/1M output
- • GPT-4o — $2.50/1M input, $10/1M output
- • Gemini Pro — $1.25/1M input, $5/1M output
Use for: complex reasoning, architecture decisions, strategy, debugging hard problems
- • Claude Opus — $15/1M input, $75/1M output
- • GPT-4.5 — $75/1M input, $150/1M output
- • o1 / o3 — Variable, reasoning-heavy
Task-to-Model Mapping
# Model Routing Rules
## Use CHEAP model (Haiku/Flash/4o-mini):
- Formatting text (markdown, JSON conversion)
- Simple classification ("is this urgent?")
- Acknowledging messages ("got it, working on it")
- Heartbeat checks (is anything new?)
- Reading and summarizing short documents
## Use BALANCED model (Sonnet/4o/Gemini Pro):
- Writing content (tweets, posts, newsletters)
- Research synthesis (combining multiple sources)
- Code generation (new features, bug fixes)
- Data analysis (trends, patterns)
- Cron job outputs (daily reports, plans)
## Use EXPERT model (Opus/o3) — sparingly:
- Architecture decisions ("how should I structure this?")
- Debugging complex issues
- Strategy and planning
- Code review for critical systems
- When Sonnet gets it wrong twice🔌 Platform-Specific Cost Optimization
- • Set default model to Sonnet in config, override to Opus only for complex tasks
- • Use
--modelper cron job to pick the right tier - • Set
contextTokens: 50000instead of 200k — most tasks don't need huge context - • Use isolated sessions for cron jobs — they start fresh without dragging history
- • Run
/compactwhen context exceeds 100k to avoid paying for repeated context
- • Use prompt caching — repeated system prompts cost 90% less after first call
- • Haiku for preprocessing, Sonnet for main work, Opus only for review
- • Set
max_tokensto limit output length (don't pay for rambling) - • Batch API: 50% discount for non-time-sensitive tasks (reports, analysis)
- • GPT-4o-mini for 80% of tasks — it's shockingly good for the price
- • Use structured outputs (JSON mode) to reduce output tokens
- • Batch API: 50% off for async processing
- • Avoid GPT-4.5 unless genuinely needed — it's 30x more expensive than 4o
- • Assign cheaper models to simple agents (research → Haiku, writing → Sonnet)
- • Set max_iterations per agent to prevent runaway loops
- • Cache tool results — don't re-search the same query twice
- • Use LangSmith/Arize to identify which agents burn the most tokens
- • Use AI nodes sparingly — each one is an API call
- • Combine multiple prompts into one node where possible
- • Cache results in a database instead of re-querying
- • Set usage alerts — Zapier/Make can spiral costs if workflows run too often
- • Use "fast" model for autocomplete, "smart" model only for complex edits
- • Be specific in prompts — vague prompts = more back-and-forth = more tokens
- • Use @file references instead of pasting entire files into chat
- • Cursor Pro ($20/mo) vs API credits — calculate which is cheaper for your usage
The Monthly Budget Framework
Calculate Your Actual Costs
Stop guessing. Slide the values below to see what your agent setup actually costs per month — then optimize from there.
Share this chapter
Chapter navigation
18 of 36