What your LLM cost calculator is getting wrong

The gap between "we should self-host our inference" and the first invoice from your GPU cloud provider is where budgets die. The model is free. The weights are downloadable. The inference cost is a function of five variables most teams never compute correctly: total parameters (not active), KV cache dtype (not weight quantization), concurrent requests, throughput bottlenecks, and replica count.

Most LLM cost calculators get these wrong in predictable ways. They use active parameters for MoE memory. They calculate KV cache in weight bytes instead of KV dtype. They price GPUs by VRAM alone, ignoring whether the GPU can serve the traffic. They treat one replica as sufficient for production.

I built the LLM Deploy Cost Calculator to compute these correctly. It is a Vite + TypeScript + React app. Every number it produces is reproducible via the "Show calculation formulas" panel.

Here are the four mistakes most calculators make, what they cost you, and how this tool handles each one.

Mistake 1: Active parameters for MoE memory

MoE models load every expert at inference time. DeepSeek V4 Pro has 49B active parameters per token and 284B total parameters in weights. A calculator that uses active_params for VRAM will tell you it needs ~24 GB of VRAM at Q4. The correct answer is 142 GB. That is a 118 GB gap. The difference between fitting on a single A100 80GB and needing two.

The calculator uses params (total) for VRAM and active_params for throughput. Both numbers come from the model's published technical report, not estimates.

Mistake 2: Weight quantization bytes for KV cache

Most calculators apply the same quantization factor to the KV cache as to the weights. This is wrong. Q4 weights are stored as 4-bit integers. KV cache tensors run in FP16, FP8, or INT4 independently.

On Qwen3-32B at 32K context with 4 concurrent requests, the KV cache in FP16 is 34 GB. The Q4 model weights are only 16 GB. The cache is twice the model. Using 0.5 bytes per KV parameter (because you quantized weights to Q4) would report a 8.5 GB cache. The real answer is 34 GB. You would buy one GPU when you need three.

The calculator has separate controls for weight quantization and KV cache dtype. FP16, FP8, and INT4 produce different numbers because they are different precisions.

Mistake 3: No throughput model

Fitting in VRAM is necessary but not sufficient. A 32B model fits on an H100. Can it serve 5,000 calls per day at peak?

Prefill is compute-bound. The formula is MFU x GPU_FP16_TFLOPS / (2 x active_params). Quantization does not help here. Kernels dequantize and run FP16 math. A Q4 model prefill throughput is identical to FP16.

Decode is memory-bandwidth-bound. The formula is batch_size x HBM_GBW / (weights_bytes + batch_size x KV_per_seq). Quantization does help here. Fewer bytes to read from HBM means higher decode throughput.

The calculator computes both, takes the maximum GPU count, and compares it to the VRAM requirement. The final GPU count is max(gpus_for_prefill, gpus_for_decode, gpus_for_vram). If the bottleneck is throughput, the recommended GPU changes from the cheapest VRAM-fit to the cheapest throughput-fit.

Mistake 4: No replica multiplier

One GPU has no redundancy. If it crashes, the service goes down. Production deployments need at least two replicas behind a load balancer.

Most calculators show the cost of one GPU. The calculator defaults to 2 replicas for its "Customer Support Bot" preset and shows the total GPU count and cost. Two RTX 3090s at reserved-1y pricing is $547/month, not the $274/month a single-GPU calculator would quote.

This is the cost that surprises teams. The GPU bill doubles the moment you take HA seriously.

Key insight: The four mistakes compound. A naive calculator says DeepSeek V4 Pro Q4 needs 24 GB of VRAM. The correct answer is 142 GB plus KV cache plus overhead. The naive calculator says one A100. The correct answer is eight A100s, times two replicas, times reserved pricing. The difference between $522/month and $9,825/month is not a rounding error. It is four independent mistakes stacked on top of each other. The naive figure excludes utilization adjustment; the correct figure includes 85% effective utilization.

The break-even you care about is the next GPU

The break-even chart is not a smooth diagonal. It is a step function.

Self-hosted cost is flat until you need another GPU. Then it jumps. API cost rises linearly with volume. The intersection is the break-even point. The calculator draws it as an interactive log-scale chart. Change any input and the chart updates. Try it.

Take the Customer Support Bot preset: Qwen3-14B Q4, 8K context, 8 concurrent, 2 replicas, reserved-1y pricing. The calculator recommends 2x RTX 3090 at $547/month. API with GPT-4o runs $107/month at 500 calls/day. API wins by $440. The break-even is ~2,600 calls/day.

But the interesting question for your SRE is not "when does API become more expensive." It is "how close am I to needing a third GPU, and what does that mean for my on-call rotation?" The jump from 2 GPUs to 3 is the difference between $547/month and $821/month. The chart shows that jump as a vertical step. The gap between steps is where you have headroom. The step itself is where you do not.

Each vertical jump in the chart represents an additional GPU required to meet sustained throughput. The flat regions show capacity headroom. A workload sitting near a step boundary should treat the next-tier GPU as the realistic plan, since traffic spikes will push you over. The "Show calculation formulas" panel in the tool renders the exact prefill TPS, decode TPS, and VRAM formulas with your current inputs substituted in.

The presets

Five real-world scenarios with production-realistic parameters. No toys.

Customer Support Bot. Qwen3-14B Q4, 8K context, 8 concurrent, 2 replicas, reserved-1y pricing, peak factor 3x. Self-hosted: $547/month (2x RTX 3090). API (GPT-4o): $107/month. API wins by 80% at 500 calls/day. Break-even at ~2,600 calls/day.

Code Assistant. Qwen3-32B Q4, 32K context, 4 concurrent, on-demand pricing, peak 2x. Self-hosted: $1,889/month (2x A100 40GB). API (GPT-4o): $168/month. API wins by 91%. VRAM is 58 GB with 59% consumed by KV cache. Multi-GPU inference required.

Enterprise RAG. DeepSeek V4 Pro Q4, 64K context, 16 concurrent, 2 replicas, reserved-1y pricing. 284B total parameters loaded at Q4 is 142 GB of weights alone. Self-hosted: $9,825/month (16x A100 40GB). API (GPT-4o): $360/month. API wins by 96%. Only makes sense when data privacy rules out API.

Startup MVP. Gemma 3-27B Q4, 16K context, spot pricing, 1 replica. Self-hosted: $331/month (1x A100 40GB). API (GPT-4o): $17/month. API wins by 95%. Break-even at ~893 calls/day. The cheapest real self-hosted deployment still costs 19x more than API at 100 calls/day.

High-Volume API Replacement. DeepSeek V4 Flash Q4, 8K context, 32 concurrent, 2 replicas, reserved-1y pricing. Self-hosted: $7,369/month (12x A100 40GB). API (GPT-4o): $750/month. API wins by 90%. This preset exists to show where self-hosting starts approaching viability: at 5,000+ calls/day with a production MoE model, the per-request cost differential narrows to 10x.

What this tool does not model

Every calculator makes tradeoffs. Here are the ones this tool makes.

Continuous batching efficiency. The throughput model assumes ideal batching. Real vLLM or TensorRT-LLM deployments have scheduling overhead that reduces effective throughput by 10-30%.

Prefix caching. Prompt reuse across calls reduces KV cache pressure. The calculator assumes every call starts from scratch.

Speculative decoding. Some models support draft-then-verify decoding that increases throughput. The calculator does not model this.

Model-parallel vs tensor-parallel split. When multi-GPU is needed, the calculator assumes tensor parallelism (NVLink). Model-parallel splits have different latency characteristics.

Network and egress costs. All GPU pricing assumes same-region. Cross-region or cross-cloud data transfer adds cost not modeled here.

Kubernetes, Triton, or serving-engine overhead. The 15% VRAM overhead covers CUDA context and allocator fragmentation. It does not cover pod overhead, Triton server memory, or Kubernetes daemon sets.

ML engineer salary. The calculator shows GPU rental cost. It does not show the 0.3 FTE of engineering time for monitoring, updates, and incident response. At scale, that is $6,000-12,000/month on top of the GPU bill.

What the dollar figures do not tell you

Read five presets where API wins 80-96% of the time and you might conclude self-hosting is irrational. The math is correct. The conclusion is incomplete.

Sometimes you self-host because you have to. When data residency, regulatory compliance, or contractual obligation forbids sending data to an API, every dollar figure above becomes irrelevant. The calculator stops being a "should we self-host" tool and becomes a "what is the cheapest way to comply" tool. For HIPAA workloads or defense contracts, the comparison is not $9,825 vs $360. It is $9,825 vs "we cannot take this contract."

RTX GPUs are sized for workstations, not production. Consumer GPUs lack NVLink, ECC memory, and datacenter cooling profiles. NVIDIA's EULA forbids datacenter deployment of GeForce and RTX cards. The calculator filters them out of multi-GPU configs for this reason. CS Bot's recommendation of 2x RTX 3090 works because each replica is a single GPU and tensor parallelism is not needed. Enterprise RAG cannot legally or practically run on 16 consumer cards.

API pricing is not a permanent number. Self-hosted costs scale with hardware deflation and electricity, both predictable. API costs scale with whatever providers decide next quarter. A three-year TCO built on today's GPT-4o rate assumes prices hold or fall. That may be right. It may not. The calculator computes a snapshot. Treat it as one.

References