AI
What is inference cost and how do you estimate it?
Inference cost is the expense of running a trained model to generate predictions or responses in production. For LLM APIs, it's calculated as (input tokens × input price) + (output tokens × output price) per request. A typical GPT-4-class API call costs $0.01–$0.10 depending on prompt length and response size.
Key Considerations
- Track input vs. output tokens separately — output tokens are 2–4× more expensive on most APIs
- Prompt caching can cut costs 50–90% for applications with repeated system prompts or context
- Smaller models (8B–70B parameter) running on your own GPUs break even versus API pricing at roughly 10M+ tokens/day
- Batch API endpoints (available from OpenAI, Anthropic) offer 50% discounts for non-real-time workloads
- Always estimate monthly cost before launch: multiply average tokens per request × expected request volume × per-token price