gpt-oss-20b-MXFP4-Q4
1 workload result across 1 hardware configuration.
Fastest local config
152.7 decode tok/s
on M3 Ultra (60-core GPU) + 96GB unified via mlx — see full run
Local runs (1 run)
Runs from contributors' own machines via MLX, llama.cpp, vLLM, exllamav2, or ollama. Signed on the submitter's hardware.
M3 Ultra (60-core GPU) + 96GB unified
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | mlx@0.31.3 | — | 152.7tok/s | 239.9tok/s | 692ms | r_3ijun8ltjnb |
Community folklore
52 unverified claims extracted from Reddit/HN comments. Lower trust than signed runs above — every row links to the source.
- communityconfidence 60%
221.0tok/s — GPT-OSS-20B on RTX 5090 via lm-studio
“GPT-OSS-20B on RTX 5090 – 221 tok/s in LM Studio (default settings + FlashAttention) Just tested **GPT-OSS-20B** locally using **LM Studio v0.3.21-b4** on my machine with an”
- communityconfidence 60%
72.00tok/s — GPT OSS 20B on M4 Pro via mlx
“edge frozen as of 2024-06, handled rescaling and converting a recipe into metric that Qwen 3 30B 3AB 2507 completely hosed, churns at about 72 tok/s on a MacBook M4 Pro Max 4 bit MLX. And it ALMOST got a side scrolling shooter working, whereas the 120B didn't, and Qwen 3 didn't …”
- communityconfidence 60%
103.0tok/s — gpt-oss-20b on M4 Max via mlx
“/gpt-oss-20b-GGUF]() (thanks [unsloth]() 🙌) The qualities are roughly the same for our use case, but performance: On an M4 Max, MLX hit \~103 tok/s, about 25% faster than GGUF. # Quick tip You can paste any Hugging Face repo name into the CLI and pull it directly: nexa …”
- communityconfidence 60%
103.0tok/s — gpt-oss-20b on M4 Max via mlx
“/gpt-oss-20b-GGUF]() (thanks [unsloth]() 🙌) The qualities are roughly the same for our use case, but performance: On an M4 Max, MLX hit \~103 tok/s, about 25% faster than GGUF. # Quick tip You can paste any Hugging Face repo name into the CLI and pull it directly: nexa …”
- communityconfidence 60%
180.0tok/s — gpt oss 20b on Radeon RX 9070 XT via llama.cpp
“hanics) gpt OSs 20v won by a metric mile. It did run faster than gpt OSs though on llama.cpp (9070xt) with gpt OSs at 140 tok/s and lfm at 180 tok/s”
- communityconfidence 60%
140.0tok/s — gpt oss 20b on Radeon RX 9070 XT via llama.cpp
“cs and structural mechanics) gpt OSs 20v won by a metric mile. It did run faster than gpt OSs though on llama.cpp (9070xt) with gpt OSs at 140 tok/s and lfm at 180 tok/s”
- communityconfidence 60%
180.0tok/s — gpt oss 20b on Radeon RX 9070 XT via llama.cpp
“hanics) gpt OSs 20v won by a metric mile. It did run faster than gpt OSs though on llama.cpp (9070xt) with gpt OSs at 140 tok/s and lfm at 180 tok/s”
- communityconfidence 60%
140.0tok/s — gpt oss 20b on Radeon RX 9070 XT via llama.cpp
“cs and structural mechanics) gpt OSs 20v won by a metric mile. It did run faster than gpt OSs though on llama.cpp (9070xt) with gpt OSs at 140 tok/s and lfm at 180 tok/s”
- communityconfidence 60%
120.0tok/s — gpt-oss-20b on RTX 3090 via llama.cpp
“i haven't run 120b, but ran gpt-oss-20b q4/q8 on a 3090 nd saw ~70–120 tokens/sec depending on threads, context length and quant. 395 will fit bigger layouts but limited memory bandwidth often cuts sustained tps, so repro”
- communityconfidence 60%
120.0tok/s — gpt-oss-20b on RTX 3090 via llama.cpp
“i haven't run 120b, but ran gpt-oss-20b q4/q8 on a 3090 nd saw ~70–120 tokens/sec depending on threads, context length and quant. 395 will fit bigger layouts but limited memory bandwidth often cuts sustained tps, so repro”