SERVER OFFLINE
LLAMA-SERVER | NEURAL INFERENCE
00:00:00 UTC
LLAMA-SERVER
Local AI Inference Protocol | RTX 1650 Optimized
Configuration Module
Quick Setup
Advanced
Select Model Architecture
› Gemma 2 9B
VRAM: 4.5GB | Speed: 50 tok/s
RECOMMENDED
› Qwen2.5 7B
VRAM: 3.5GB | Speed: 55 tok/s
STABLE
› Llama 3.2 8B
VRAM: 4GB | Speed: 50 tok/s
STABLE
› Phi-3 Medium 14B
VRAM: 7GB | Speed: 35 tok/s
ADVANCED
› Mistral 7B v0.3
VRAM: 3.5GB | Speed: 50 tok/s
STABLE
Context Window (Tokens)
1024 - Conservative
2048 - Recommended
4096 - Advanced
Server Port
Copy Command
Full Guide
✓ Optimized for RTX 1650 + 36GB RAM
✓ Zero-crash configuration
✓ Extended chat stability
GPU Layers Offload (--ngl)
Lower values = more CPU, higher values = more GPU
Batch Size
KV Cache Type
q8_0 - Quantized (Recommended)
f16 - Full Precision
Host Memory Cache (MB)
Update Command
Command & Monitoring
Est. VRAM
4.5GB
Tokens/Sec
50
Start Command
COPY
./llama-server \ --model gemma2-9b-q4_k_m.gguf \ --ctx-size 2048 \ --np 1 \ --batch-size 512 \ --flash-attn on \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --ngl 20 \ --port 8080
Ollama Shortcut
COPY
ollama run gemma2:9b
⚙ CRITICAL SETTINGS FOR STABILITY:
Context: 2048 (prevents memory bloat)
Parallel slots: 1 (no fragmentation)
KV quantization: q8_0 (saves 50% cache)
No Gemma 4 (known crash bug)
Model Performance Matrix
Model
Gemma 2 9B
VRAM (Q4)
4.5 GB
Speed
50 tok/s
Model
Parameters
VRAM
Speed
Stable
Gemma 2
9B
4.5GB
50
✓
Qwen2.5
7B
3.5GB
55
✓
Llama 3.2
8B
4GB
50
✓
Phi-3 Medium
14B
7GB
35
⚠
Mistral 7B
7B
3.5GB
50
✓
Hardware Specifications
GPU:
NVIDIA RTX 1650
VRAM:
4 GB GDDR5
System RAM:
36 GB
Connectivity:
PCIe 3.0 x16
Offloading Strategy
•
GPU:
4 GB model weights + active KV cache
•
RAM:
32 GB remaining layers + history
•
Speed:
20-50 tok/sec (acceptable)
•
Stability:
Zero crashes with q8_0 quantization
Quick Reference
Size Calculation
10B model × 0.5 GB/B = 5 GB at Q4
Download Models
ollama pull gemma2:9b
or
huggingface-cli download
Monitor Memory
Linux: watch -n 1 'nvidia-smi && free -h'
Windows: tasklist /v | find "llama"
Key Flags
--ctx-size 2048 → Prevent bloat
--np 1 → No fragmentation
--ngl 20 → GPU/RAM balance
--cache-type-k q8_0 → Save memory