M3 Ultra (60-core GPU) — LLM benchmarks

24 workload results across 21 models.

Fastest known config on M3 Ultra (60-core GPU)

192.5 decode tok/s

stable-code-instruct-3b-4bit via mlx — see full run

Qwen3-Next-80B-A3B-Instruct-MLX-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	4bit	80.34tok/s	24.49tok/s	4,493ms	r_1pl79r50ofy
chat-long	mlx@0.31.3	4bit	77.63tok/s	1,608.6tok/s	1,956ms	r_1pl79r50ofy
concurrent-decode	mlx@0.31.3	4bit	78.60tok/s	no data	no data	r_1pl79r50ofy
agent-trace	mlx@0.31.3	4bit	78.41tok/s	1,586.3tok/s	1,318ms	r_1pl79r50ofy

qwen2.5-72b-Instruct-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	—	16.31tok/s	23.25tok/s	5,635ms	r_5c80gthqlh6

llama-3.3-70b-Instruct-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	—	16.78tok/s	25.09tok/s	5,420ms	r_sx3a4y9n-m4

stable-code-instruct-3b-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	—	192.5tok/s	560.7tok/s	226ms	r_y2_5y8oo97d

Yi-Coder-9B-Chat-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	—	103.5tok/s	390.9tok/s	307ms	r_3hvui9a1yuc

starcoder2-15b-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	—	64.16tok/s	220.5tok/s	490ms	r_wsxml_39dh_

granite-8b-code-instruct-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	—	112.3tok/s	525.7tok/s	221ms	r_bue3bee0gw7

Codestral-22B-v0.1-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	—	47.49tok/s	205.7tok/s	559ms	r_79dvtag5fd_

DeepSeek-Coder-V2-Lite-Instruct-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	—	168.3tok/s	291.5tok/s	449ms	r_l_v1-zq_qaz

Qwen2.5-Coder-32B-Instruct-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	—	34.48tok/s	144.1tok/s	909ms	r_721b4bls_oq

Qwen2.5-Coder-14B-Instruct-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	—	70.51tok/s	306.1tok/s	428ms	r_l36cijqxq4t

Qwen2.5-Coder-7B-Instruct-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	—	138.6tok/s	539.6tok/s	243ms	r_uoehjq0nvc0

gpt-oss-20b-MXFP4-Q4

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	—	152.7tok/s	239.9tok/s	692ms	r_3ijun8ltjnb

Qwen3-Coder-30B-A3B-Instruct-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	—	112.2tok/s	204.0tok/s	539ms	r_fpsca03u2o_

Qwen3-32B-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	—	34.41tok/s	95.12tok/s	1,156ms	r_anmmc80-aoq

Qwen2.5-32B-Instruct-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	4bit	34.60tok/s	141.9tok/s	923ms	r_njgxtgyym1e

gemma-2-9b-it-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	—	89.45tok/s	170.7tok/s	691ms	r_iz137eqvuzy

Qwen2.5-14B-Instruct-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	4bit	70.85tok/s	301.7tok/s	434ms	r_v4bq1sviz4o

phi-4-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	—	74.37tok/s	255.0tok/s	443ms	r_sqzp-0rdez-

Llama-3.1-8B-Instruct-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	—	130.2tok/s	400.3tok/s	340ms	r_v2pbc0rq2l4

Qwen2.5-7B-Instruct-4bit

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	mlx@0.31.3	4bit	139.6tok/s	190.0tok/s	689ms	r_5r6rhiynenc

Community folklore on M3 Ultra (60-core GPU)

52 unverified claims extracted from Reddit/HN comments. Lower trust than signed runs above — every row links to the source.

communityconfidence 70%
18.50tok/s — Qwen3-Coder on M3 Ultra via mlx
“m3 ultra with lmstudio-community/Qwen3-Coder-480B-A35B-Instruct-MLX-6bit 256k context len 18.50 token/s 1000 tokens first token 12.44 s”
source: Reddit · u/EnvironmentalMath660 · 2025-07-24
communityconfidence 70%
38.00tok/s — Qwen3-Coder-Next on M3 Ultra via vllm
“: - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - Deep reasoning with tool calling I built this to run OpenClaw locally on my Mac instead of paying for cloud APIs. Qwen3-Code…”
source: HN · u/raullen · 2026-02-26
communityconfidence 70%
65.00tok/s — Qwen3-Coder-Next on M3 Ultra via vllm
“upstream had minimal test coverage. Benchmarks (Mac Studio M3 Ultra, 256GB): Qwen3-Coder-Next-6bit (80B MoE, 3B active): - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - De…”
source: HN · u/raullen · 2026-02-26
communityconfidence 70%
38.00tok/s — Qwen3-Coder-Next on M3 Ultra via vllm
“: - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - Deep reasoning with tool calling I built this to run OpenClaw locally on my Mac instead of paying for cloud APIs. Qwen3-Code…”
source: HN · u/raullen · 2026-02-26
communityconfidence 70%
65.00tok/s — Qwen3-Coder-Next on M3 Ultra via vllm
“upstream had minimal test coverage. Benchmarks (Mac Studio M3 Ultra, 256GB): Qwen3-Coder-Next-6bit (80B MoE, 3B active): - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - De…”
source: HN · u/raullen · 2026-02-26
communityconfidence 70%
38.00tok/s — Qwen3-Coder-Next on M3 Ultra via vllm
“: - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - Deep reasoning with tool calling I built this to run OpenClaw locally on my Mac instead of paying for cloud APIs. Qwen3-Code…”
source: HN · u/raullen · 2026-02-26
communityconfidence 70%
65.00tok/s — Qwen3-Coder-Next on M3 Ultra via vllm
“upstream had minimal test coverage. Benchmarks (Mac Studio M3 Ultra, 256GB): Qwen3-Coder-Next-6bit (80B MoE, 3B active): - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - De…”
source: HN · u/raullen · 2026-02-26
communityconfidence 70%
38.00tok/s — Qwen3-Coder-Next on M3 Ultra via vllm
“: - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - Deep reasoning with tool calling I built this to run OpenClaw locally on my Mac instead of paying for cloud APIs. Qwen3-Code…”
source: HN · u/raullen · 2026-02-26
communityconfidence 70%
65.00tok/s — Qwen3-Coder-Next on M3 Ultra via vllm
“upstream had minimal test coverage. Benchmarks (Mac Studio M3 Ultra, 256GB): Qwen3-Coder-Next-6bit (80B MoE, 3B active): - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - De…”
source: HN · u/raullen · 2026-02-26
communityconfidence 70%
38.00tok/s — Qwen3-Coder-Next on M3 Ultra via vllm
“: - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - Deep reasoning with tool calling I built this to run OpenClaw locally on my Mac instead of paying for cloud APIs. Qwen3-Code…”
source: HN · u/raullen · 2026-02-26

See all 52 claims for M3 Ultra (60-core GPU) →

Common questions about M3 Ultra (60-core GPU)

Direct Q&A drawn from the runs above: fastest LLM, supported model classes, backend rankings, quantization guidance.

Read the M3 Ultra (60-core GPU) FAQ →

Community folklore on M3 Ultra (60-core GPU)

Models measured on M3 Ultra (60-core GPU)

Common questions about M3 Ultra (60-core GPU)