M3 Ultra (60-core GPU) — LLM benchmarks
24 workload results across 21 models.
Fastest known config on M3 Ultra (60-core GPU)
192.5 decode tok/s
stable-code-instruct-3b-4bit via mlx — see full run
Qwen3-Next-80B-A3B-Instruct-MLX-4bit
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | mlx@0.31.3 | 4bit | 80.34tok/s | 24.49tok/s | 4,493ms | r_1pl79r50ofy |
| chat-long | mlx@0.31.3 | 4bit | 77.63tok/s | 1,608.6tok/s | 1,956ms | r_1pl79r50ofy |
| concurrent-decode | mlx@0.31.3 | 4bit | 78.60tok/s | no data | no data | r_1pl79r50ofy |
| agent-trace | mlx@0.31.3 | 4bit | 78.41tok/s | 1,586.3tok/s | 1,318ms | r_1pl79r50ofy |
qwen2.5-72b-Instruct-4bit
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | mlx@0.31.3 | — | 16.31tok/s | 23.25tok/s | 5,635ms | r_5c80gthqlh6 |
llama-3.3-70b-Instruct-4bit
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | mlx@0.31.3 | — | 16.78tok/s | 25.09tok/s | 5,420ms | r_sx3a4y9n-m4 |
stable-code-instruct-3b-4bit
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | mlx@0.31.3 | — | 192.5tok/s | 560.7tok/s | 226ms | r_y2_5y8oo97d |
Yi-Coder-9B-Chat-4bit
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | mlx@0.31.3 | — | 103.5tok/s | 390.9tok/s | 307ms | r_3hvui9a1yuc |
starcoder2-15b-4bit
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | mlx@0.31.3 | — | 64.16tok/s | 220.5tok/s | 490ms | r_wsxml_39dh_ |
granite-8b-code-instruct-4bit
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | mlx@0.31.3 | — | 112.3tok/s | 525.7tok/s | 221ms | r_bue3bee0gw7 |
Codestral-22B-v0.1-4bit
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | mlx@0.31.3 | — | 47.49tok/s | 205.7tok/s | 559ms | r_79dvtag5fd_ |
DeepSeek-Coder-V2-Lite-Instruct-4bit
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | mlx@0.31.3 | — | 168.3tok/s | 291.5tok/s | 449ms | r_l_v1-zq_qaz |
Qwen2.5-Coder-32B-Instruct-4bit
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | mlx@0.31.3 | — | 34.48tok/s | 144.1tok/s | 909ms | r_721b4bls_oq |
Qwen2.5-Coder-14B-Instruct-4bit
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | mlx@0.31.3 | — | 70.51tok/s | 306.1tok/s | 428ms | r_l36cijqxq4t |
Qwen2.5-Coder-7B-Instruct-4bit
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | mlx@0.31.3 | — | 138.6tok/s | 539.6tok/s | 243ms | r_uoehjq0nvc0 |
gpt-oss-20b-MXFP4-Q4
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | mlx@0.31.3 | — | 152.7tok/s | 239.9tok/s | 692ms | r_3ijun8ltjnb |
Qwen3-Coder-30B-A3B-Instruct-4bit
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | mlx@0.31.3 | — | 112.2tok/s | 204.0tok/s | 539ms | r_fpsca03u2o_ |
Qwen3-32B-4bit
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | mlx@0.31.3 | — | 34.41tok/s | 95.12tok/s | 1,156ms | r_anmmc80-aoq |
Qwen2.5-32B-Instruct-4bit
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | mlx@0.31.3 | 4bit | 34.60tok/s | 141.9tok/s | 923ms | r_njgxtgyym1e |
gemma-2-9b-it-4bit
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | mlx@0.31.3 | — | 89.45tok/s | 170.7tok/s | 691ms | r_iz137eqvuzy |
Qwen2.5-14B-Instruct-4bit
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | mlx@0.31.3 | 4bit | 70.85tok/s | 301.7tok/s | 434ms | r_v4bq1sviz4o |
phi-4-4bit
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | mlx@0.31.3 | — | 74.37tok/s | 255.0tok/s | 443ms | r_sqzp-0rdez- |
Llama-3.1-8B-Instruct-4bit
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | mlx@0.31.3 | — | 130.2tok/s | 400.3tok/s | 340ms | r_v2pbc0rq2l4 |
Qwen2.5-7B-Instruct-4bit
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | mlx@0.31.3 | 4bit | 139.6tok/s | 190.0tok/s | 689ms | r_5r6rhiynenc |
Community folklore on M3 Ultra (60-core GPU)
52 unverified claims extracted from Reddit/HN comments. Lower trust than signed runs above — every row links to the source.
- communityconfidence 70%
18.50tok/s — Qwen3-Coder on M3 Ultra via mlx
“m3 ultra with lmstudio-community/Qwen3-Coder-480B-A35B-Instruct-MLX-6bit 256k context len 18.50 token/s 1000 tokens first token 12.44 s”
- communityconfidence 70%
38.00tok/s — Qwen3-Coder-Next on M3 Ultra via vllm
“: - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - Deep reasoning with tool calling I built this to run OpenClaw locally on my Mac instead of paying for cloud APIs. Qwen3-Code…”
- communityconfidence 70%
65.00tok/s — Qwen3-Coder-Next on M3 Ultra via vllm
“upstream had minimal test coverage. Benchmarks (Mac Studio M3 Ultra, 256GB): Qwen3-Coder-Next-6bit (80B MoE, 3B active): - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - De…”
- communityconfidence 70%
38.00tok/s — Qwen3-Coder-Next on M3 Ultra via vllm
“: - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - Deep reasoning with tool calling I built this to run OpenClaw locally on my Mac instead of paying for cloud APIs. Qwen3-Code…”
- communityconfidence 70%
65.00tok/s — Qwen3-Coder-Next on M3 Ultra via vllm
“upstream had minimal test coverage. Benchmarks (Mac Studio M3 Ultra, 256GB): Qwen3-Coder-Next-6bit (80B MoE, 3B active): - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - De…”
- communityconfidence 70%
38.00tok/s — Qwen3-Coder-Next on M3 Ultra via vllm
“: - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - Deep reasoning with tool calling I built this to run OpenClaw locally on my Mac instead of paying for cloud APIs. Qwen3-Code…”
- communityconfidence 70%
65.00tok/s — Qwen3-Coder-Next on M3 Ultra via vllm
“upstream had minimal test coverage. Benchmarks (Mac Studio M3 Ultra, 256GB): Qwen3-Coder-Next-6bit (80B MoE, 3B active): - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - De…”
- communityconfidence 70%
38.00tok/s — Qwen3-Coder-Next on M3 Ultra via vllm
“: - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - Deep reasoning with tool calling I built this to run OpenClaw locally on my Mac instead of paying for cloud APIs. Qwen3-Code…”
- communityconfidence 70%
65.00tok/s — Qwen3-Coder-Next on M3 Ultra via vllm
“upstream had minimal test coverage. Benchmarks (Mac Studio M3 Ultra, 256GB): Qwen3-Coder-Next-6bit (80B MoE, 3B active): - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - De…”
- communityconfidence 70%
38.00tok/s — Qwen3-Coder-Next on M3 Ultra via vllm
“: - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - Deep reasoning with tool calling I built this to run OpenClaw locally on my Mac instead of paying for cloud APIs. Qwen3-Code…”
Models measured on M3 Ultra (60-core GPU)
- Qwen3-Next-80B-A3B-Instruct-MLX-4bit benchmarks
- qwen2.5-72b-Instruct-4bit benchmarks
- llama-3.3-70b-Instruct-4bit benchmarks
- stable-code-instruct-3b-4bit benchmarks
- Yi-Coder-9B-Chat-4bit benchmarks
- starcoder2-15b-4bit benchmarks
- granite-8b-code-instruct-4bit benchmarks
- Codestral-22B-v0.1-4bit benchmarks
- DeepSeek-Coder-V2-Lite-Instruct-4bit benchmarks
- Qwen2.5-Coder-32B-Instruct-4bit benchmarks
- Qwen2.5-Coder-14B-Instruct-4bit benchmarks
- Qwen2.5-Coder-7B-Instruct-4bit benchmarks
- gpt-oss-20b-MXFP4-Q4 benchmarks
- Qwen3-Coder-30B-A3B-Instruct-4bit benchmarks
- Qwen3-32B-4bit benchmarks
- Qwen2.5-32B-Instruct-4bit benchmarks
- gemma-2-9b-it-4bit benchmarks
- Qwen2.5-14B-Instruct-4bit benchmarks
- phi-4-4bit benchmarks
- Llama-3.1-8B-Instruct-4bit benchmarks
- Qwen2.5-7B-Instruct-4bit benchmarks
Common questions about M3 Ultra (60-core GPU)
Direct Q&A drawn from the runs above: fastest LLM, supported model classes, backend rankings, quantization guidance.