RTX 5090 (32GB) — LLM benchmarks
6 workload results across 1 model.
Fastest known config on RTX 5090 (32GB)
69.9 decode tok/s
Qwen3.6-27B-Q4_K_M.gguf via llama.cpp — see full run
Qwen3.6-27B-Q4_K_M.gguf
| Workload | Backend | Quant | decode tok/s | prefill tok/s | TTFT | Run |
|---|---|---|---|---|---|---|
| chat-short | llama.cpp | — | 67.28tok/s | no data | 553ms | r_1pww-w7p8sd |
| chat-short | llama.cpp | — | 69.89tok/s | no data | 3,995ms | r_bqsunbd6xa8 |
| chat-short | llama.cpp | — | 47.75tok/s | no data | 2,833ms | r_kj4fh_mmzj9 |
| chat-short | llama.cpp | — | 45.92tok/s | no data | 3,089ms | r_4u7250hj28o |
| chat-short | llama.cpp | — | 39.61tok/s | no data | 227ms | r__b89kg2iica |
| chat-short | llama.cpp | — | 66.28tok/s | no data | 353ms | r_79bwm4mq_4l |
Community folklore on RTX 5090 (32GB)
134 unverified claims extracted from Reddit/HN comments. Lower trust than signed runs above — every row links to the source.
- communityconfidence 75%
10.00tok/s — Qwen3-Coder-Next on RTX 5090 via llama.cpp Q4_K_S
“95fa52dbf8ebec6acaf0105e1e9 Hey all, Just a quick one in case it saves someone else a headache. I was getting really poor throughput (\~10 tok/sec) with Qwen3-Coder-Next-Q4\_K\_S.gguf on llama.cpp, like “this can’t be right” levels, and eventually found a set of args that fix…”
- communityconfidence 75%
26.00tok/s — Qwen3-Coder-Next on RTX 5090 via llama.cpp Q4_K_S
“~26 tok/sec with Unsloth Qwen3-Coder-Next-Q4_K_S on RTX 5090 (Windows/llama.cpp) https://preview.redd.it/9gfytpz5srhg1.png?width=692&format=png&auto=w”
- communityconfidence 75%
10.00tok/s — Qwen3-Coder-Next on RTX 5090 via llama.cpp Q4_K_S
“95fa52dbf8ebec6acaf0105e1e9 Hey all, Just a quick one in case it saves someone else a headache. I was getting really poor throughput (\~10 tok/sec) with Qwen3-Coder-Next-Q4\_K\_S.gguf on llama.cpp, like “this can’t be right” levels, and eventually found a set of args that fix…”
- communityconfidence 75%
26.00tok/s — Qwen3-Coder-Next on RTX 5090 via llama.cpp Q4_K_S
“~26 tok/sec with Unsloth Qwen3-Coder-Next-Q4_K_S on RTX 5090 (Windows/llama.cpp) https://preview.redd.it/9gfytpz5srhg1.png?width=692&format=png&auto=w”
- communityconfidence 75%
10.00tok/s — Qwen3-Coder-Next on RTX 5090 via llama.cpp Q4_K_S
“95fa52dbf8ebec6acaf0105e1e9 Hey all, Just a quick one in case it saves someone else a headache. I was getting really poor throughput (\~10 tok/sec) with Qwen3-Coder-Next-Q4\_K\_S.gguf on llama.cpp, like “this can’t be right” levels, and eventually found a set of args that fix…”
- communityconfidence 75%
26.00tok/s — Qwen3-Coder-Next on RTX 5090 via llama.cpp Q4_K_S
“~26 tok/sec with Unsloth Qwen3-Coder-Next-Q4_K_S on RTX 5090 (Windows/llama.cpp) https://preview.redd.it/9gfytpz5srhg1.png?width=692&format=png&auto=w”
- communityconfidence 75%
207.9tok/s — Qwen3-Coder on RTX 5090 via sglang AWQ
“hoosing the Framework **RTX 5090 — Qwen3-Coder-30B-A3B-Instruct-AWQ** |Metric|vLLM|SGLang| |:-|:-|:-| |Output throughput|**555.82 tok/s**|207.93 tok/s| |Mean TTFT|**549 ms**|1,558 ms| |Median TPOT|**7.06 ms**|18.84 ms| vLLM wins by 2.7x. SGLang is required `--quantization moe_…”
- communityconfidence 75%
555.8tok/s — Qwen3-Coder on RTX 5090 via sglang AWQ
“atency? # 1. Choosing the Framework **RTX 5090 — Qwen3-Coder-30B-A3B-Instruct-AWQ** |Metric|vLLM|SGLang| |:-|:-|:-| |Output throughput|**555.82 tok/s**|207.93 tok/s| |Mean TTFT|**549 ms**|1,558 ms| |Median TPOT|**7.06 ms**|18.84 ms| vLLM wins by 2.7x. SGLang is required `--qu…”
- communityconfidence 75%
207.9tok/s — Qwen3-Coder on RTX 5090 via sglang AWQ
“hoosing the Framework **RTX 5090 — Qwen3-Coder-30B-A3B-Instruct-AWQ** |Metric|vLLM|SGLang| |:-|:-|:-| |Output throughput|**555.82 tok/s**|207.93 tok/s| |Mean TTFT|**549 ms**|1,558 ms| |Median TPOT|**7.06 ms**|18.84 ms| vLLM wins by 2.7x. SGLang is required `--quantization moe_…”
- communityconfidence 75%
555.8tok/s — Qwen3-Coder on RTX 5090 via sglang AWQ
“atency? # 1. Choosing the Framework **RTX 5090 — Qwen3-Coder-30B-A3B-Instruct-AWQ** |Metric|vLLM|SGLang| |:-|:-|:-| |Output throughput|**555.82 tok/s**|207.93 tok/s| |Mean TTFT|**549 ms**|1,558 ms| |Median TPOT|**7.06 ms**|18.84 ms| vLLM wins by 2.7x. SGLang is required `--qu…”
Models measured on RTX 5090 (32GB)
Common questions about RTX 5090 (32GB)
Direct Q&A drawn from the runs above: fastest LLM, supported model classes, backend rankings, quantization guidance.