RTX 3090 Ti — LLM benchmarks
No benchmarks on RTX 3090 Ti yet.
No RTX 3090 Ti benchmarks yet.
Run on YOUR hardware to populate this page: pipx install llm-speed && llm-speed bench
$ pipx install llm-speed && llm-speed bench
Community folklore on RTX 3090 Ti
19 unverified claims extracted from Reddit/HN comments. Lower trust than signed runs above — every row links to the source.
- communityconfidence 75%
30.00tok/s — Qwen-3 Coder on RTX 3090 Ti via ollama FP16
“on: **FP16** * Hardware: **RTX 5090 + RTX 3090 Ti** * Task: code generation **Results:** * **llama.cpp:** \~52 tokens/sec * **Ollama:** \~30 tokens/sec Both runs use the same model weights and hardware. The gap is \~70% in favor of llama.cpp. Has anyone dug into why this happ…”
- communityconfidence 75%
52.00tok/s — Qwen-3 Coder on RTX 3090 Ti via llama.cpp FP16
“**Qwen-3 Coder 32B** * Precision: **FP16** * Hardware: **RTX 5090 + RTX 3090 Ti** * Task: code generation **Results:** * **llama.cpp:** \~52 tokens/sec * **Ollama:** \~30 tokens/sec Both runs use the same model weights and hardware. The gap is \~70% in favor of llama.cpp. Has…”
- communityconfidence 75%
55.00tok/s — Mistral Nemo on RTX 3090 Ti via lm-studio FP16
“s on both of the GPUs, so for example Mistral Nemo 13B instruct in FP16 and get 21.5 tokens/sec. Llama 3.1 Instruct Q8 15Gb results in 51 - 55 tokens per second. I also tried running Llama 3.1 Q5 KS, 49GB model which worked but was extremely slow due to some of the model going to…”
- communityconfidence 75%
21.50tok/s — Mistral Nemo on RTX 3090 Ti via lm-studio FP16
“AI and LLM models, now I can completely offload latest models on both of the GPUs, so for example Mistral Nemo 13B instruct in FP16 and get 21.5 tokens/sec. Llama 3.1 Instruct Q8 15Gb results in 51 - 55 tokens per second. I also tried running Llama 3.1 Q5 KS, 49GB model which wor…”
- communityconfidence 75%
55.00tok/s — Mistral Nemo on RTX 3090 Ti via lm-studio FP16
“s on both of the GPUs, so for example Mistral Nemo 13B instruct in FP16 and get 21.5 tokens/sec. Llama 3.1 Instruct Q8 15Gb results in 51 - 55 tokens per second. I also tried running Llama 3.1 Q5 KS, 49GB model which worked but was extremely slow due to some of the model going to…”
- communityconfidence 75%
21.50tok/s — Mistral Nemo on RTX 3090 Ti via lm-studio FP16
“AI and LLM models, now I can completely offload latest models on both of the GPUs, so for example Mistral Nemo 13B instruct in FP16 and get 21.5 tokens/sec. Llama 3.1 Instruct Q8 15Gb results in 51 - 55 tokens per second. I also tried running Llama 3.1 Q5 KS, 49GB model which wor…”
- communityconfidence 60%
46.00tok/s — Llama 3 8B on RTX 3090 Ti via ollama
“del locally. I have since tried both mlc-llm as well as ollama (based on llama.cpp). mlc-llm is slightly faster (\~51 tok/s) vs ollama (\~46 tok/s) for running the 16 bit unquantized version of Llama 3 8B on my RTX 3090 Ti. However, mlc-llm uses about 2GB of VRAM more than the …”
- communityconfidence 60%
51.00tok/s — Llama 3 8B on RTX 3090 Ti via ollama
“s the way to run the model locally. I have since tried both mlc-llm as well as ollama (based on llama.cpp). mlc-llm is slightly faster (\~51 tok/s) vs ollama (\~46 tok/s) for running the 16 bit unquantized version of Llama 3 8B on my RTX 3090 Ti. However, mlc-llm uses about 2GB…”
- communityconfidence 60%
500.0tok/s — Llama 3 8B on RTX 3090 Ti via vllm
“10 mins I can go through 24.9M prompt tokens and 1.69M generated tokens on 2x rtx 3090 Ti and Llama 3 8B W8A8 in vLLM 0.8.3 V1, so about 41,500 t/s input and 2800 t/s output, both at the same time. I went through like 8B+ tokens on single card this way before I put in the second …”
- communityconfidence 60%
500.0tok/s — Llama 3 8B on RTX 3090 Ti via vllm
“10 mins I can go through 24.9M prompt tokens and 1.69M generated tokens on 2x rtx 3090 Ti and Llama 3 8B W8A8 in vLLM 0.8.3 V1, so about 41,500 t/s input and 2800 t/s output, both at the same time. I went through like 8B+ tokens on single card this way before I put in the second …”
Common questions about RTX 3090 Ti
Direct Q&A drawn from the runs above: fastest LLM, supported model classes, backend rankings, quantization guidance.