Skip to content
llm-speed
Leaderboard/models/gpt-oss-20b

gpt-oss-20b-MXFP4-Q4

1 workload result across 1 hardware configuration.

Fastest local config

152.7 decode tok/s

on M3 Ultra (60-core GPU) + 96GB unified via mlx see full run

Local runs (1 run)

Runs from contributors' own machines via MLX, llama.cpp, vLLM, exllamav2, or ollama. Signed on the submitter's hardware.

M3 Ultra (60-core GPU) + 96GB unifiedM3 Ultra (60-core GPU) + 96GB unified

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortmlx@0.31.3152.7tok/s239.9tok/s692msr_3ijun8ltjnb

Community folklore

52 unverified claims extracted from Reddit/HN comments. Lower trust than signed runs above — every row links to the source.

  • communityconfidence 60%

    221.0tok/s GPT-OSS-20B on RTX 5090 via lm-studio

    GPT-OSS-20B on RTX 5090 – 221 tok/s in LM Studio (default settings + FlashAttention) Just tested **GPT-OSS-20B** locally using **LM Studio v0.3.21-b4** on my machine with an

    source: Reddit · u/Spiritual_Tie_5574 · 2025-08-05

  • communityconfidence 60%

    72.00tok/s GPT OSS 20B on M4 Pro via mlx

    edge frozen as of 2024-06, handled rescaling and converting a recipe into metric that Qwen 3 30B 3AB 2507 completely hosed, churns at about 72 tok/s on a MacBook M4 Pro Max 4 bit MLX. And it ALMOST got a side scrolling shooter working, whereas the 120B didn't, and Qwen 3 didn't …

    source: Reddit · u/cspenn · 2025-08-05

  • communityconfidence 60%

    103.0tok/s gpt-oss-20b on M4 Max via mlx

    /gpt-oss-20b-GGUF]() (thanks [unsloth]() 🙌) The qualities are roughly the same for our use case, but performance: On an M4 Max, MLX hit \~103 tok/s, about 25% faster than GGUF. # Quick tip You can paste any Hugging Face repo name into the CLI and pull it directly: nexa …

    source: Reddit · u/AlanzhuLy · 2025-08-09

  • communityconfidence 60%

    103.0tok/s gpt-oss-20b on M4 Max via mlx

    /gpt-oss-20b-GGUF]() (thanks [unsloth]() 🙌) The qualities are roughly the same for our use case, but performance: On an M4 Max, MLX hit \~103 tok/s, about 25% faster than GGUF. # Quick tip You can paste any Hugging Face repo name into the CLI and pull it directly: nexa …

    source: Reddit · u/AlanzhuLy · 2025-08-09

  • communityconfidence 60%

    180.0tok/s gpt oss 20b on Radeon RX 9070 XT via llama.cpp

    hanics) gpt OSs 20v won by a metric mile. It did run faster than gpt OSs though on llama.cpp (9070xt) with gpt OSs at 140 tok/s and lfm at 180 tok/s

    source: Reddit · u/ConversationOver9445 · 2026-02-25

  • communityconfidence 60%

    140.0tok/s gpt oss 20b on Radeon RX 9070 XT via llama.cpp

    cs and structural mechanics) gpt OSs 20v won by a metric mile. It did run faster than gpt OSs though on llama.cpp (9070xt) with gpt OSs at 140 tok/s and lfm at 180 tok/s

    source: Reddit · u/ConversationOver9445 · 2026-02-25

  • communityconfidence 60%

    180.0tok/s gpt oss 20b on Radeon RX 9070 XT via llama.cpp

    hanics) gpt OSs 20v won by a metric mile. It did run faster than gpt OSs though on llama.cpp (9070xt) with gpt OSs at 140 tok/s and lfm at 180 tok/s

    source: Reddit · u/ConversationOver9445 · 2026-02-25

  • communityconfidence 60%

    140.0tok/s gpt oss 20b on Radeon RX 9070 XT via llama.cpp

    cs and structural mechanics) gpt OSs 20v won by a metric mile. It did run faster than gpt OSs though on llama.cpp (9070xt) with gpt OSs at 140 tok/s and lfm at 180 tok/s

    source: Reddit · u/ConversationOver9445 · 2026-02-25

  • communityconfidence 60%

    120.0tok/s gpt-oss-20b on RTX 3090 via llama.cpp

    i haven't run 120b, but ran gpt-oss-20b q4/q8 on a 3090 nd saw ~70–120 tokens/sec depending on threads, context length and quant. 395 will fit bigger layouts but limited memory bandwidth often cuts sustained tps, so repro

    source: Reddit · u/SimpleMundane5291 · 2025-09-07

  • communityconfidence 60%

    120.0tok/s gpt-oss-20b on RTX 3090 via llama.cpp

    i haven't run 120b, but ran gpt-oss-20b q4/q8 on a 3090 nd saw ~70–120 tokens/sec depending on threads, context length and quant. 395 will fit bigger layouts but limited memory bandwidth often cuts sustained tps, so repro

    source: Reddit · u/SimpleMundane5291 · 2025-09-07

See all 52 claims for gpt-oss-20b-MXFP4-Q4

gpt-oss-20b-MXFP4-Q4 on hardware