RTX 5090 (32GB) — LLM benchmarks

6 workload results across 1 model.

Fastest known config on RTX 5090 (32GB)

69.9 decode tok/s

Qwen3.6-27B-Q4_K_M.gguf via llama.cpp — see full run

Qwen3.6-27B-Q4_K_M.gguf

Workload	Backend	Quant	decode tok/s	prefill tok/s	TTFT	Run
chat-short	llama.cpp	—	67.28tok/s	no data	553ms	r_1pww-w7p8sd
chat-short	llama.cpp	—	69.89tok/s	no data	3,995ms	r_bqsunbd6xa8
chat-short	llama.cpp	—	47.75tok/s	no data	2,833ms	r_kj4fh_mmzj9
chat-short	llama.cpp	—	45.92tok/s	no data	3,089ms	r_4u7250hj28o
chat-short	llama.cpp	—	39.61tok/s	no data	227ms	r__b89kg2iica
chat-short	llama.cpp	—	66.28tok/s	no data	353ms	r_79bwm4mq_4l

Community folklore on RTX 5090 (32GB)

134 unverified claims extracted from Reddit/HN comments. Lower trust than signed runs above — every row links to the source.

communityconfidence 75%
10.00tok/s — Qwen3-Coder-Next on RTX 5090 via llama.cpp Q4_K_S
“95fa52dbf8ebec6acaf0105e1e9 Hey all, Just a quick one in case it saves someone else a headache. I was getting really poor throughput (\~10 tok/sec) with Qwen3-Coder-Next-Q4\_K\_S.gguf on llama.cpp, like “this can’t be right” levels, and eventually found a set of args that fix…”
source: Reddit · u/Spiritual_Tie_5574 · 2026-02-06
communityconfidence 75%
26.00tok/s — Qwen3-Coder-Next on RTX 5090 via llama.cpp Q4_K_S
“~26 tok/sec with Unsloth Qwen3-Coder-Next-Q4_K_S on RTX 5090 (Windows/llama.cpp) https://preview.redd.it/9gfytpz5srhg1.png?width=692&format=png&auto=w”
source: Reddit · u/Spiritual_Tie_5574 · 2026-02-06
communityconfidence 75%
10.00tok/s — Qwen3-Coder-Next on RTX 5090 via llama.cpp Q4_K_S
“95fa52dbf8ebec6acaf0105e1e9 Hey all, Just a quick one in case it saves someone else a headache. I was getting really poor throughput (\~10 tok/sec) with Qwen3-Coder-Next-Q4\_K\_S.gguf on llama.cpp, like “this can’t be right” levels, and eventually found a set of args that fix…”
source: Reddit · u/Spiritual_Tie_5574 · 2026-02-06
communityconfidence 75%
26.00tok/s — Qwen3-Coder-Next on RTX 5090 via llama.cpp Q4_K_S
“~26 tok/sec with Unsloth Qwen3-Coder-Next-Q4_K_S on RTX 5090 (Windows/llama.cpp) https://preview.redd.it/9gfytpz5srhg1.png?width=692&format=png&auto=w”
source: Reddit · u/Spiritual_Tie_5574 · 2026-02-06
communityconfidence 75%
10.00tok/s — Qwen3-Coder-Next on RTX 5090 via llama.cpp Q4_K_S
“95fa52dbf8ebec6acaf0105e1e9 Hey all, Just a quick one in case it saves someone else a headache. I was getting really poor throughput (\~10 tok/sec) with Qwen3-Coder-Next-Q4\_K\_S.gguf on llama.cpp, like “this can’t be right” levels, and eventually found a set of args that fix…”
source: Reddit · u/Spiritual_Tie_5574 · 2026-02-06
communityconfidence 75%
26.00tok/s — Qwen3-Coder-Next on RTX 5090 via llama.cpp Q4_K_S
“~26 tok/sec with Unsloth Qwen3-Coder-Next-Q4_K_S on RTX 5090 (Windows/llama.cpp) https://preview.redd.it/9gfytpz5srhg1.png?width=692&format=png&auto=w”
source: Reddit · u/Spiritual_Tie_5574 · 2026-02-06
communityconfidence 75%
207.9tok/s — Qwen3-Coder on RTX 5090 via sglang AWQ
“hoosing the Framework **RTX 5090 — Qwen3-Coder-30B-A3B-Instruct-AWQ** |Metric|vLLM|SGLang| |:-|:-|:-| |Output throughput|**555.82 tok/s**|207.93 tok/s| |Mean TTFT|**549 ms**|1,558 ms| |Median TPOT|**7.06 ms**|18.84 ms| vLLM wins by 2.7x. SGLang is required `--quantization moe_…”
source: Reddit · u/NoVibeCoding · 2026-03-06
communityconfidence 75%
555.8tok/s — Qwen3-Coder on RTX 5090 via sglang AWQ
“atency? # 1. Choosing the Framework **RTX 5090 — Qwen3-Coder-30B-A3B-Instruct-AWQ** |Metric|vLLM|SGLang| |:-|:-|:-| |Output throughput|**555.82 tok/s**|207.93 tok/s| |Mean TTFT|**549 ms**|1,558 ms| |Median TPOT|**7.06 ms**|18.84 ms| vLLM wins by 2.7x. SGLang is required `--qu…”
source: Reddit · u/NoVibeCoding · 2026-03-06
communityconfidence 75%
207.9tok/s — Qwen3-Coder on RTX 5090 via sglang AWQ
“hoosing the Framework **RTX 5090 — Qwen3-Coder-30B-A3B-Instruct-AWQ** |Metric|vLLM|SGLang| |:-|:-|:-| |Output throughput|**555.82 tok/s**|207.93 tok/s| |Mean TTFT|**549 ms**|1,558 ms| |Median TPOT|**7.06 ms**|18.84 ms| vLLM wins by 2.7x. SGLang is required `--quantization moe_…”
source: Reddit · u/NoVibeCoding · 2026-03-06
communityconfidence 75%
555.8tok/s — Qwen3-Coder on RTX 5090 via sglang AWQ
“atency? # 1. Choosing the Framework **RTX 5090 — Qwen3-Coder-30B-A3B-Instruct-AWQ** |Metric|vLLM|SGLang| |:-|:-|:-| |Output throughput|**555.82 tok/s**|207.93 tok/s| |Mean TTFT|**549 ms**|1,558 ms| |Median TPOT|**7.06 ms**|18.84 ms| vLLM wins by 2.7x. SGLang is required `--qu…”
source: Reddit · u/NoVibeCoding · 2026-03-06

See all 134 claims for RTX 5090 (32GB) →

Models measured on RTX 5090 (32GB)

Qwen3.6-27B-Q4_K_M.gguf benchmarks

Common questions about RTX 5090 (32GB)

Direct Q&A drawn from the runs above: fastest LLM, supported model classes, backend rankings, quantization guidance.

Read the RTX 5090 (32GB) FAQ →