Skip to content
llm-speed

RTX 5090 (32GB) — LLM benchmarks

6 workload results across 1 model.

Fastest known config on RTX 5090 (32GB)

69.9 decode tok/s

Qwen3.6-27B-Q4_K_M.gguf via llama.cpp see full run

Qwen3.6-27B-Q4_K_M.gguf

WorkloadBackendQuantdecode tok/sprefill tok/sTTFTRun
chat-shortllama.cpp67.28tok/sno data553msr_1pww-w7p8sd
chat-shortllama.cpp69.89tok/sno data3,995msr_bqsunbd6xa8
chat-shortllama.cpp47.75tok/sno data2,833msr_kj4fh_mmzj9
chat-shortllama.cpp45.92tok/sno data3,089msr_4u7250hj28o
chat-shortllama.cpp39.61tok/sno data227msr__b89kg2iica
chat-shortllama.cpp66.28tok/sno data353msr_79bwm4mq_4l

Community folklore on RTX 5090 (32GB)

134 unverified claims extracted from Reddit/HN comments. Lower trust than signed runs above — every row links to the source.

  • communityconfidence 75%

    10.00tok/s Qwen3-Coder-Next on RTX 5090 via llama.cpp Q4_K_S

    95fa52dbf8ebec6acaf0105e1e9 Hey all, Just a quick one in case it saves someone else a headache. I was getting really poor throughput (\~10 tok/sec) with Qwen3-Coder-Next-Q4\_K\_S.gguf on llama.cpp, like “this can’t be right” levels, and eventually found a set of args that fix…

    source: Reddit · u/Spiritual_Tie_5574 · 2026-02-06

  • communityconfidence 75%

    26.00tok/s Qwen3-Coder-Next on RTX 5090 via llama.cpp Q4_K_S

    ~26 tok/sec with Unsloth Qwen3-Coder-Next-Q4_K_S on RTX 5090 (Windows/llama.cpp) https://preview.redd.it/9gfytpz5srhg1.png?width=692&format=png&auto=w

    source: Reddit · u/Spiritual_Tie_5574 · 2026-02-06

  • communityconfidence 75%

    10.00tok/s Qwen3-Coder-Next on RTX 5090 via llama.cpp Q4_K_S

    95fa52dbf8ebec6acaf0105e1e9 Hey all, Just a quick one in case it saves someone else a headache. I was getting really poor throughput (\~10 tok/sec) with Qwen3-Coder-Next-Q4\_K\_S.gguf on llama.cpp, like “this can’t be right” levels, and eventually found a set of args that fix…

    source: Reddit · u/Spiritual_Tie_5574 · 2026-02-06

  • communityconfidence 75%

    26.00tok/s Qwen3-Coder-Next on RTX 5090 via llama.cpp Q4_K_S

    ~26 tok/sec with Unsloth Qwen3-Coder-Next-Q4_K_S on RTX 5090 (Windows/llama.cpp) https://preview.redd.it/9gfytpz5srhg1.png?width=692&format=png&auto=w

    source: Reddit · u/Spiritual_Tie_5574 · 2026-02-06

  • communityconfidence 75%

    10.00tok/s Qwen3-Coder-Next on RTX 5090 via llama.cpp Q4_K_S

    95fa52dbf8ebec6acaf0105e1e9 Hey all, Just a quick one in case it saves someone else a headache. I was getting really poor throughput (\~10 tok/sec) with Qwen3-Coder-Next-Q4\_K\_S.gguf on llama.cpp, like “this can’t be right” levels, and eventually found a set of args that fix…

    source: Reddit · u/Spiritual_Tie_5574 · 2026-02-06

  • communityconfidence 75%

    26.00tok/s Qwen3-Coder-Next on RTX 5090 via llama.cpp Q4_K_S

    ~26 tok/sec with Unsloth Qwen3-Coder-Next-Q4_K_S on RTX 5090 (Windows/llama.cpp) https://preview.redd.it/9gfytpz5srhg1.png?width=692&format=png&auto=w

    source: Reddit · u/Spiritual_Tie_5574 · 2026-02-06

  • communityconfidence 75%

    207.9tok/s Qwen3-Coder on RTX 5090 via sglang AWQ

    hoosing the Framework **RTX 5090 — Qwen3-Coder-30B-A3B-Instruct-AWQ** |Metric|vLLM|SGLang| |:-|:-|:-| |Output throughput|**555.82 tok/s**|207.93 tok/s| |Mean TTFT|**549 ms**|1,558 ms| |Median TPOT|**7.06 ms**|18.84 ms| vLLM wins by 2.7x. SGLang is required `--quantization moe_…

    source: Reddit · u/NoVibeCoding · 2026-03-06

  • communityconfidence 75%

    555.8tok/s Qwen3-Coder on RTX 5090 via sglang AWQ

    atency? # 1. Choosing the Framework **RTX 5090 — Qwen3-Coder-30B-A3B-Instruct-AWQ** |Metric|vLLM|SGLang| |:-|:-|:-| |Output throughput|**555.82 tok/s**|207.93 tok/s| |Mean TTFT|**549 ms**|1,558 ms| |Median TPOT|**7.06 ms**|18.84 ms| vLLM wins by 2.7x. SGLang is required `--qu…

    source: Reddit · u/NoVibeCoding · 2026-03-06

  • communityconfidence 75%

    207.9tok/s Qwen3-Coder on RTX 5090 via sglang AWQ

    hoosing the Framework **RTX 5090 — Qwen3-Coder-30B-A3B-Instruct-AWQ** |Metric|vLLM|SGLang| |:-|:-|:-| |Output throughput|**555.82 tok/s**|207.93 tok/s| |Mean TTFT|**549 ms**|1,558 ms| |Median TPOT|**7.06 ms**|18.84 ms| vLLM wins by 2.7x. SGLang is required `--quantization moe_…

    source: Reddit · u/NoVibeCoding · 2026-03-06

  • communityconfidence 75%

    555.8tok/s Qwen3-Coder on RTX 5090 via sglang AWQ

    atency? # 1. Choosing the Framework **RTX 5090 — Qwen3-Coder-30B-A3B-Instruct-AWQ** |Metric|vLLM|SGLang| |:-|:-|:-| |Output throughput|**555.82 tok/s**|207.93 tok/s| |Mean TTFT|**549 ms**|1,558 ms| |Median TPOT|**7.06 ms**|18.84 ms| vLLM wins by 2.7x. SGLang is required `--qu…

    source: Reddit · u/NoVibeCoding · 2026-03-06

See all 134 claims for RTX 5090 (32GB)

Models measured on RTX 5090 (32GB)

Common questions about RTX 5090 (32GB)

Direct Q&A drawn from the runs above: fastest LLM, supported model classes, backend rankings, quantization guidance.

Read the RTX 5090 (32GB) FAQ →