Skip to main content
AI

What is inference cost and how do you estimate it?

Inference cost is the expense of running a trained model to generate predictions or responses in production. For LLM APIs, it's calculated as (input tokens × input price) + (output tokens × output price) per request. A typical GPT-4-class API call costs $0.01–$0.10 depending on prompt length and response size.

Key Considerations

  • Track input vs. output tokens separately — output tokens are 2–4× more expensive on most APIs
  • Prompt caching can cut costs 50–90% for applications with repeated system prompts or context
  • Smaller models (8B–70B parameter) running on your own GPUs break even versus API pricing at roughly 10M+ tokens/day
  • Batch API endpoints (available from OpenAI, Anthropic) offer 50% discounts for non-real-time workloads
  • Always estimate monthly cost before launch: multiply average tokens per request × expected request volume × per-token price