The Slack Message That Started Everything
It was a Thursday morning when our operations lead dropped a message in the #engineering channel: "Our OpenAI bill just crossed $12,000 this month. Last month it was $8,400. What is happening?"
What was happening was success. Our AI-powered features were gaining traction. More users meant more API calls. More API calls meant more tokens. More tokens meant a bill that was growing 40% month-over-month with no ceiling in sight.
We were building an intelligent assistant for a SaaS platform (if you're curious about what it costs to build a SaaS like this in 2026, I broke it all down) — think smart search, automated document analysis, and conversational support. The features were driving real engagement and retention. Killing them was not an option. But at $12K per month and climbing, the economics did not work either.
Over the next six weeks, we cut that bill to $3,200 per month — a 73% reduction — while maintaining the same output quality. Here is exactly how we did it.
Step Zero: Auditing Where the Money Goes
Before optimizing anything, we needed to understand where the tokens were being spent. This sounds obvious, but most teams skip this step. They start optimizing prompts randomly without knowing which calls cost the most.
We added simple logging to every API call: endpoint, model used, input tokens, output tokens, total cost, and response time. After one week of data, the picture was clear:
| Feature | % of Total Cost | Avg Tokens/Call | Calls/Day |
|---|---|---|---|
| Document Analysis | 42% | 4,200 | 850 |
| Conversational Support | 31% | 2,800 | 1,200 |
| Smart Search | 18% | 1,100 | 2,400 |
| Content Generation | 9% | 3,500 | 180 |
Two features — document analysis and conversational support — were eating 73% of our budget. That is where we focused first.
Technique 1: Prompt Caching (40% Savings Alone)
This was our single biggest win, and it is shockingly underused.
Prompt caching works on a simple principle: if you send the same system prompt and context to the API repeatedly, the provider can cache the processed version and skip redundant computation. Anthropic and OpenAI both support this now, but many teams do not take advantage of it.
In our document analysis feature, every call included a 2,000-token system prompt that described the analysis framework, output format, and domain-specific rules. This prompt was identical across every single call. We were paying to process those 2,000 tokens 850 times per day — that is 1.7 million tokens daily just for the same instructions, over and over.
What We Changed
We restructured our API calls to separate the static system prompt from the dynamic user content. By using prompt caching headers, the API provider cached our system prompt after the first call. Subsequent calls only processed the new, unique content.
Before: 4,200 tokens per call (2,000 system + 2,200 content)
After: 2,200 tokens per call (system prompt cached at 90% discount)
For document analysis alone, this saved us roughly $2,100 per month.
We applied the same approach to conversational support, where an 1,800-token system prompt was being sent with every message in every conversation. That saved another $1,400 per month.
Implementation Tip
Structure your prompts with a clear separation between static instructions and dynamic content. Put everything that does not change between calls into the system prompt, and keep user-specific content in the user messages. This maximizes cache hit rates.
Technique 2: Smart Model Routing (22% Additional Savings)
Not every request needs GPT-4 (or Claude Opus). This is the second most common waste pattern I see: teams default to the most powerful model for everything, regardless of task complexity.
We analyzed our request patterns and found that 60% of our conversational support queries were simple — FAQ-type questions, status checks, basic information retrieval. These did not need a frontier model. A smaller, faster, cheaper model handled them just as well.
Our Routing Logic
We built a lightweight classifier (rule-based, not ML — keep it simple) that categorized incoming requests:
- Simple queries (FAQ, status, basic info) → Smaller model (GPT-4o-mini / Claude Haiku) at ~10% the cost
- Complex queries (analysis, multi-step reasoning, nuanced responses) → Frontier model (GPT-4o / Claude Sonnet)
- Critical tasks (document analysis, content generation requiring high accuracy) → Best available model
The classifier used keyword matching, query length, and conversation context to route. It was not perfect — about 8% of queries got misrouted — but the cost savings vastly outweighed the occasional suboptimal response.
Monthly savings: ~$1,800
Important Nuance
We added a feedback mechanism: if a user asked a follow-up question after getting a response from the smaller model, we automatically escalated subsequent messages to the frontier model. This caught most cases where the cheaper model was not sufficient.
Technique 3: Response Streaming and Early Termination
This is a subtle one. When using streaming responses, you only pay for tokens that are actually generated. If you can detect that a response is going off-track or has already provided the needed information, you can terminate the stream early and save tokens.
In our smart search feature, the model sometimes generated long explanatory responses when the user just needed a list of results. We added logic to detect when the core answer had been delivered (usually within the first 200-300 tokens) and truncated the rest.
We also set strict max_tokens limits per feature instead of using generous defaults:
- Smart search: max 500 tokens (was 2,000)
- Conversational support: max 800 tokens (was 2,000)
- Document analysis: max 1,500 tokens (was 4,000)
- Content generation: kept at 4,000 (needs full output)
Monthly savings: ~$600
Technique 4: Embedding-Based Pre-Filtering
Our document analysis feature was sending entire documents to the LLM for analysis. A 10-page document might be 8,000 tokens. But often, the user's question only related to one or two sections.
We added a pre-filtering step using text embeddings (much cheaper than LLM calls):
- Split documents into chunks during upload
- Generate embeddings for each chunk (one-time cost)
- When a user asks a question, embed the question and find the most relevant chunks using vector similarity
- Send only the relevant chunks to the LLM, not the entire document
This reduced the average input tokens for document analysis from 4,200 to 1,800. The embedding calls cost roughly $15/month total — trivial compared to the LLM savings.
Monthly savings: ~$900
Technique 5: Batch Processing for Non-Real-Time Tasks
Our content generation feature ran on-demand — a user clicked a button and waited for the AI to generate content. But we discovered that many of these requests were not actually time-sensitive. Users would generate content and come back to review it hours later.
We moved content generation to a batch processing queue. Requests were queued and processed during off-peak hours using the Batch API, which offers a 50% discount on most providers.
For the 180 daily content generation calls, this was straightforward: queue the request, notify the user when it is ready (usually within 30 minutes), and let them review at their convenience.
Monthly savings: ~$400
The Before and After
Here is the complete picture after implementing all five techniques:
| Metric | Before | After | Change |
|---|---|---|---|
| Monthly API Cost | $12,000 | $3,200 | -73% |
| Avg Tokens per Request | 2,850 | 1,100 | -61% |
| Avg Response Time | 2.4s | 1.8s | -25% |
| User Satisfaction Score | 4.2/5 | 4.3/5 | +2% |
| Daily API Calls | 4,630 | 4,580 | -1% |
Notice that user satisfaction actually went up. Faster response times (from caching and streaming) improved the experience even as we cut costs.
The Quality Check: How We Verified Nothing Degraded
Cost cutting means nothing if your product gets worse. Here is how we measured quality throughout the process:
1. A/B Testing Each Change
Every technique was rolled out behind a feature flag. We ran the optimized version alongside the original for one week per change, comparing:
- User satisfaction ratings (thumbs up/down on responses)
- Follow-up question rate (high rate = first response was inadequate)
- Task completion rate (did users achieve their goal?)
2. Automated Quality Scoring
We built a simple evaluation pipeline that sampled 100 responses daily and scored them against reference answers using a separate LLM call. Any quality drop below 5% triggered an alert.
3. User Feedback Monitoring
We tracked support tickets mentioning AI features. Any spike in complaints after a change meant we rolled it back immediately.
Across all five techniques, quality stayed within 2% of baseline. The model routing change had the highest variance (8% misroute rate), but the escalation mechanism caught most issues before users noticed.
The Key Takeaway
AI cost optimization is an architecture problem, not a prompt problem.
Most articles about reducing AI costs focus on prompt engineering — making prompts shorter, more efficient, better structured. That matters, but it is optimizing at the wrong level.
The real savings come from architectural decisions:
- What model handles what request? (routing)
- What data reaches the model? (pre-filtering)
- What computation can be reused? (caching)
- What output is actually needed? (streaming and truncation)
- What work can happen asynchronously? (batching)
These are engineering decisions, not prompt decisions. And they require someone who understands both the AI capabilities and the system architecture — exactly why your CTO should still write code.
If your AI costs are climbing and you are only tweaking prompts, you are leaving 60-70% of potential savings on the table.
Struggling with AI costs in your product? I help companies architect AI features that scale without breaking the budget. Let us talk about your setup.
