LLAMA-SERVER | NEURAL INFERENCE

Configuration Module

Select Model Architecture

› Gemma 2 9B

VRAM: 4.5GB | Speed: 50 tok/s

RECOMMENDED

› Qwen2.5 7B

VRAM: 3.5GB | Speed: 55 tok/s

STABLE

› Llama 3.2 8B

VRAM: 4GB | Speed: 50 tok/s

STABLE

› Phi-3 Medium 14B

VRAM: 7GB | Speed: 35 tok/s

ADVANCED

› Mistral 7B v0.3

VRAM: 3.5GB | Speed: 50 tok/s

STABLE

Context Window (Tokens)

Server Port

✓ Optimized for RTX 1650 + 36GB RAM
✓ Zero-crash configuration
✓ Extended chat stability

GPU Layers Offload (--ngl)

Lower values = more CPU, higher values = more GPU

Batch Size

KV Cache Type

Host Memory Cache (MB)

Command & Monitoring

Est. VRAM

4.5GB

Tokens/Sec

Start Command

./llama-server \ --model gemma2-9b-q4_k_m.gguf \ --ctx-size 2048 \ --np 1 \ --batch-size 512 \ --flash-attn on \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --ngl 20 \ --port 8080

Ollama Shortcut

ollama run gemma2:9b

⚙ CRITICAL SETTINGS FOR STABILITY:

Context: 2048 (prevents memory bloat)
Parallel slots: 1 (no fragmentation)
KV quantization: q8_0 (saves 50% cache)
No Gemma 4 (known crash bug)

Model Performance Matrix

Model

Gemma 2 9B

VRAM (Q4)

4.5 GB

Speed

50 tok/s

Model	Parameters	VRAM	Speed	Stable
Gemma 2	9B	4.5GB	50	✓
Qwen2.5	7B	3.5GB	55	✓
Llama 3.2	8B	4GB	50	✓
Phi-3 Medium	14B	7GB	35	⚠
Mistral 7B	7B	3.5GB	50	✓

Hardware Specifications

GPU:

NVIDIA RTX 1650

VRAM:

4 GB GDDR5

System RAM:

36 GB

Connectivity:

PCIe 3.0 x16

Offloading Strategy

• GPU: 4 GB model weights + active KV cache
• RAM: 32 GB remaining layers + history
• Speed: 20-50 tok/sec (acceptable)
• Stability: Zero crashes with q8_0 quantization

Quick Reference

Size Calculation

10B model × 0.5 GB/B = 5 GB at Q4

Download Models

ollama pull gemma2:9b
or
huggingface-cli download

Monitor Memory

Linux: watch -n 1 'nvidia-smi && free -h'
Windows: tasklist /v | find "llama"

Key Flags

--ctx-size 2048 → Prevent bloat
--np 1 → No fragmentation
--ngl 20 → GPU/RAM balance
--cache-type-k q8_0 → Save memory