Local vs hosted: when does buying a GPU pay off?
At low usage, hosted APIs win on $/Mtok. At high sustained usage, a 4090 or M3 Ultra wins. Here's the break-even math, run against live numbers.
The reference local rig on the leaderboard is M3 Ultra (60-core GPU) at 168.3tok/s. At ~$0.40 per million output tokens (typical 70B hosted price), a $2k GPU breaks even on output cost alone after roughly 5 billion output tokens generated — power and duty cycle move that number in either direction. The per-rig table below lets you re-do the arithmetic for your own usage.
Reference local rig on the leaderboard: M3 Ultra (60-core GPU) at 168.3tok/s.
We don't sell hardware and we don't take affiliate commissions on hosted APIs, so the framing is just arithmetic. A consumer GPU's break-even point against a hosted endpoint depends on three things: your sustained decode tok/s, the hosted price per million output tokens, and your duty cycle. Below is a comparison table with each row anchored to a real submitted benchmark.
Submitted benchmarks
| Hardware | Model | decode tok/s | Run |
|---|---|---|---|
| M3 Ultra (60-core GPU) | mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit | 168.3tok/s | r_l_v1-zq_qaz |
| RTX 5090 (32GB) | Qwen3.6-27B-Q4_K_M.gguf | 69.89tok/s | r_bqsunbd6xa8 |
| Pentest-Bench | <script>alert(1)</script> | 42.00tok/s | r_a3ei8og3rkg |
| M3 Pro (18-core GPU) | mlx-community-Qwen2.5-7B-Instruct-4bit | 30.52tok/s | r_llzv_g-ymaf |
| smoke-host | Qwen3-32B-Instruct.Q4_K_M | 10.00tok/s | r_r7fc52oxuvq |
| x | m | 10.00tok/s | r_0_i4fok_cfg |
Side-by-side comparisons
See also: All hardware · All models · Methodology