Skip to content
llm-speed

M3 Pro (18-core GPU) — LLM benchmarks

11 workload results across 9 models.

Fastest known config on M3 Pro (18-core GPU)

286.5 decode tok/s

Qwen2.5-0.5B-Instruct-4bit via mlx (4bit) see full run

Qwen3-32B-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.37.16tok/s25.65tok/s4,288msr_pnrrpcdqfo4

Qwen2.5-32B-Instruct-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.37.05tok/s33.41tok/s3,921msr_f4x3xaan2ay

gemma-2-9b-it-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.321.97tok/s102.5tok/s1,151msr_q7t7b7dcuz5

Qwen2.5-14B-Instruct-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.34bit14.41tok/s117.9tok/s1,111msr_ia73dzeue0b

phi-4-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.315.61tok/s120.7tok/s936msr_-w8hnn61va_

Llama-3.1-8B-Instruct-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.329.20tok/s203.3tok/s669msr_h0-use1ypnb

Qwen2.5-7B-Instruct-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.34bit30.52tok/s161.9tok/s809msr_llzv_g-ymaf
chat-shortmlx@0.31.34bit15.67tok/s54.34tok/s2,411msr_e3t93rscswq

stable-code-instruct-3b-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.319.37tok/s131.3tok/s967msr_pqjsvd-cub4

Qwen2.5-0.5B-Instruct-4bit

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.34bit286.5tok/s656.2tok/s200msr_akcbpx5vcqa
chat-shortmlx@0.31.34bit282.6tok/s668.8tok/s196msr_bftqtkilvoe

Models measured on M3 Pro (18-core GPU)

Common questions about M3 Pro (18-core GPU)

Direct Q&A drawn from the runs above: fastest LLM, supported model classes, backend rankings, quantization guidance.

Read the M3 Pro (18-core GPU) FAQ →