NVIDIA RTX 5090 Inference Benchmark

Local inference performance data for the NVIDIA RTX 5090. Benchmarks across model sizes, quantization levels, and batch configurations from BNL's production hardware.

Type Reference

32 GB VRAM 45 tok/s (7B) Blackwell Architecture 575W TDP

RTX 5090 NVIDIA GPU Inference Benchmarks

RTX 5090 for Local LLM Inference

The NVIDIA RTX 5090 represents the current peak of consumer-grade GPU capability for local LLM inference. Blue Note Logic uses the RTX 5090 as the primary inference GPU in our Gilligan.TECH development environment.

Hardware Specifications

Specification	RTX 5090
GPU Architecture	Blackwell
CUDA Cores	21,760
VRAM	32 GB GDDR7
Memory Bandwidth	1,792 GB/s
TDP	575W
FP16 Performance	~209 TFLOPS

Inference Benchmarks (7B Models)

Model / Quant	Tokens/sec	VRAM Used	Time to First Token
Qwen 2.5 7B Q4_K_M	52 tok/s	5.9 GB	0.8s
dobetter-norge-v2 Q5_K_M	45 tok/s	6.6 GB	0.9s
Qwen 2.5 7B Q8_0	33 tok/s	9.2 GB	1.1s
Llama 3.1 8B Q5_K_M	42 tok/s	7.1 GB	1.0s

Inference Benchmarks (13B–70B Models)

Model / Quant	Tokens/sec	VRAM Used	Notes
Qwen 2.5 14B Q4_K_M	28 tok/s	9.8 GB	Fits comfortably
Llama 3.1 70B Q2_K	8 tok/s	28.1 GB	Near VRAM limit
Llama 3.1 70B Q4_K_M	—	>32 GB	Does not fit

Recommendations

7B models: Any quantization fits easily. Q5_K_M is the sweet spot.
13B–14B models: Q4_K_M or Q5_K_M fit well, good performance.
70B models: Only Q2_K/Q3_K fits in 32GB VRAM. Consider cloud GPUs or multi-GPU for higher quantizations.

BNL Perspective

The RTX 5090 is genuinely impressive for local inference. We run dobetter-norge-v2 at 45 tokens per second with Q5_K_M — that's fast enough for real-time interactive use. The 32GB VRAM is the real differentiator versus the 4090's 24GB: it opens up 13B-14B models at reasonable quantization levels. For our workflow — fine-tune on cloud A100s, deploy locally on the 5090 for development and testing — it's the ideal hardware.

Access Resource ← Back to Tech Resources

GUIDE

GPU Inference Configuration Guide

Setting up local AI inference infrastructure: CUDA toolkit installation, driver requirements, VRAM planning for different model sizes, and multi-GPU configuration.

Access →

NVIDIA RTX 5090 Inference Benchmark

RTX 5090 for Local LLM Inference

Hardware Specifications

Inference Benchmarks (7B Models)

Inference Benchmarks (13B–70B Models)

Recommendations

Related Resources

GPU Inference Configuration Guide

Start with CaveauAI.
Then choose the operating model that fits.

NVIDIA RTX 5090 Inference Benchmark

RTX 5090 for Local LLM Inference

Hardware Specifications

Inference Benchmarks (7B Models)

Inference Benchmarks (13B–70B Models)

Recommendations

Related Resources

GPU Inference Configuration Guide

Start with CaveauAI.Then choose the operating model that fits.

Start with CaveauAI.
Then choose the operating model that fits.