RTX 5090 for Local LLM Inference
The NVIDIA RTX 5090 represents the current peak of consumer-grade GPU capability for local LLM inference. Blue Note Logic uses the RTX 5090 as the primary inference GPU in our Gilligan.TECH development environment.
Hardware Specifications
| Specification | RTX 5090 |
| GPU Architecture | Blackwell |
| CUDA Cores | 21,760 |
| VRAM | 32 GB GDDR7 |
| Memory Bandwidth | 1,792 GB/s |
| TDP | 575W |
| FP16 Performance | ~209 TFLOPS |
Inference Benchmarks (7B Models)
| Model / Quant | Tokens/sec | VRAM Used | Time to First Token |
| Qwen 2.5 7B Q4_K_M | 52 tok/s | 5.9 GB | 0.8s |
| dobetter-norge-v2 Q5_K_M | 45 tok/s | 6.6 GB | 0.9s |
| Qwen 2.5 7B Q8_0 | 33 tok/s | 9.2 GB | 1.1s |
| Llama 3.1 8B Q5_K_M | 42 tok/s | 7.1 GB | 1.0s |
Inference Benchmarks (13B–70B Models)
| Model / Quant | Tokens/sec | VRAM Used | Notes |
| Qwen 2.5 14B Q4_K_M | 28 tok/s | 9.8 GB | Fits comfortably |
| Llama 3.1 70B Q2_K | 8 tok/s | 28.1 GB | Near VRAM limit |
| Llama 3.1 70B Q4_K_M | — | >32 GB | Does not fit |
Recommendations
- 7B models: Any quantization fits easily. Q5_K_M is the sweet spot.
- 13B–14B models: Q4_K_M or Q5_K_M fit well, good performance.
- 70B models: Only Q2_K/Q3_K fits in 32GB VRAM. Consider cloud GPUs or multi-GPU for higher quantizations.