Skip to content
llm-speed

M3 Ultra (60-core GPU) — LLM benchmarks

24 workload results across 21 models.

Fastest known config on M3 Ultra (60-core GPU)

192.5 decode tok/s

stable-code-instruct-3b-4bit via mlx see full run

Qwen3-Next-80B-A3B-Instruct-MLX-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.34bit80.34tok/s24.49tok/s4,493msr_1pl79r50ofy
chat-longmlx@0.31.34bit77.63tok/s1,608.6tok/s1,956msr_1pl79r50ofy
concurrent-decodemlx@0.31.34bit78.60tok/sno datano datar_1pl79r50ofy
agent-tracemlx@0.31.34bit78.41tok/s1,586.3tok/s1,318msr_1pl79r50ofy

qwen2.5-72b-Instruct-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.316.31tok/s23.25tok/s5,635msr_5c80gthqlh6

llama-3.3-70b-Instruct-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.316.78tok/s25.09tok/s5,420msr_sx3a4y9n-m4

stable-code-instruct-3b-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.3192.5tok/s560.7tok/s226msr_y2_5y8oo97d

Yi-Coder-9B-Chat-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.3103.5tok/s390.9tok/s307msr_3hvui9a1yuc

starcoder2-15b-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.364.16tok/s220.5tok/s490msr_wsxml_39dh_

granite-8b-code-instruct-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.3112.3tok/s525.7tok/s221msr_bue3bee0gw7

Codestral-22B-v0.1-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.347.49tok/s205.7tok/s559msr_79dvtag5fd_

DeepSeek-Coder-V2-Lite-Instruct-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.3168.3tok/s291.5tok/s449msr_l_v1-zq_qaz

Qwen2.5-Coder-32B-Instruct-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.334.48tok/s144.1tok/s909msr_721b4bls_oq

Qwen2.5-Coder-14B-Instruct-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.370.51tok/s306.1tok/s428msr_l36cijqxq4t

Qwen2.5-Coder-7B-Instruct-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.3138.6tok/s539.6tok/s243msr_uoehjq0nvc0

gpt-oss-20b-MXFP4-Q4

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.3152.7tok/s239.9tok/s692msr_3ijun8ltjnb

Qwen3-Coder-30B-A3B-Instruct-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.3112.2tok/s204.0tok/s539msr_fpsca03u2o_

Qwen3-32B-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.334.41tok/s95.12tok/s1,156msr_anmmc80-aoq

Qwen2.5-32B-Instruct-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.34bit34.60tok/s141.9tok/s923msr_njgxtgyym1e

gemma-2-9b-it-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.389.45tok/s170.7tok/s691msr_iz137eqvuzy

Qwen2.5-14B-Instruct-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.34bit70.85tok/s301.7tok/s434msr_v4bq1sviz4o

phi-4-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.374.37tok/s255.0tok/s443msr_sqzp-0rdez-

Llama-3.1-8B-Instruct-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.3130.2tok/s400.3tok/s340msr_v2pbc0rq2l4

Qwen2.5-7B-Instruct-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.34bit139.6tok/s190.0tok/s689msr_5r6rhiynenc

Community folklore on M3 Ultra (60-core GPU)

52 unverified claims extracted from Reddit/HN comments. Lower trust than signed runs above — every row links to the source.

  • communityconfidence 70%

    18.50tok/s Qwen3-Coder on M3 Ultra via mlx

    m3 ultra with lmstudio-community/Qwen3-Coder-480B-A35B-Instruct-MLX-6bit 256k context len 18.50 token/s 1000 tokens first token 12.44 s

    source: Reddit · u/EnvironmentalMath660 · 2025-07-24

  • communityconfidence 70%

    38.00tok/s Qwen3-Coder-Next on M3 Ultra via vllm

    : - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - Deep reasoning with tool calling I built this to run OpenClaw locally on my Mac instead of paying for cloud APIs. Qwen3-Code…

    source: HN · u/raullen · 2026-02-26

  • communityconfidence 70%

    65.00tok/s Qwen3-Coder-Next on M3 Ultra via vllm

    upstream had minimal test coverage. Benchmarks (Mac Studio M3 Ultra, 256GB): Qwen3-Coder-Next-6bit (80B MoE, 3B active): - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - De…

    source: HN · u/raullen · 2026-02-26

  • communityconfidence 70%

    38.00tok/s Qwen3-Coder-Next on M3 Ultra via vllm

    : - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - Deep reasoning with tool calling I built this to run OpenClaw locally on my Mac instead of paying for cloud APIs. Qwen3-Code…

    source: HN · u/raullen · 2026-02-26

  • communityconfidence 70%

    65.00tok/s Qwen3-Coder-Next on M3 Ultra via vllm

    upstream had minimal test coverage. Benchmarks (Mac Studio M3 Ultra, 256GB): Qwen3-Coder-Next-6bit (80B MoE, 3B active): - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - De…

    source: HN · u/raullen · 2026-02-26

  • communityconfidence 70%

    38.00tok/s Qwen3-Coder-Next on M3 Ultra via vllm

    : - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - Deep reasoning with tool calling I built this to run OpenClaw locally on my Mac instead of paying for cloud APIs. Qwen3-Code…

    source: HN · u/raullen · 2026-02-26

  • communityconfidence 70%

    65.00tok/s Qwen3-Coder-Next on M3 Ultra via vllm

    upstream had minimal test coverage. Benchmarks (Mac Studio M3 Ultra, 256GB): Qwen3-Coder-Next-6bit (80B MoE, 3B active): - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - De…

    source: HN · u/raullen · 2026-02-26

  • communityconfidence 70%

    38.00tok/s Qwen3-Coder-Next on M3 Ultra via vllm

    : - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - Deep reasoning with tool calling I built this to run OpenClaw locally on my Mac instead of paying for cloud APIs. Qwen3-Code…

    source: HN · u/raullen · 2026-02-26

  • communityconfidence 70%

    65.00tok/s Qwen3-Coder-Next on M3 Ultra via vllm

    upstream had minimal test coverage. Benchmarks (Mac Studio M3 Ultra, 256GB): Qwen3-Coder-Next-6bit (80B MoE, 3B active): - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - De…

    source: HN · u/raullen · 2026-02-26

  • communityconfidence 70%

    38.00tok/s Qwen3-Coder-Next on M3 Ultra via vllm

    : - Decode: 65 tok/s - Prefill: 1090-1440 tok/s - TTFT (cache hit, 33K context): 0.3s MiniMax-M2.5-4bit (229B MoE): - Decode: 33-38 tok/s - Deep reasoning with tool calling I built this to run OpenClaw locally on my Mac instead of paying for cloud APIs. Qwen3-Code…

    source: HN · u/raullen · 2026-02-26

See all 52 claims for M3 Ultra (60-core GPU)

Models measured on M3 Ultra (60-core GPU)

Common questions about M3 Ultra (60-core GPU)

Direct Q&A drawn from the runs above: fastest LLM, supported model classes, backend rankings, quantization guidance.

Read the M3 Ultra (60-core GPU) FAQ →