Run language, vision, speech, image and embedding models locally. CPU, GPU, NPU. Powered by upstream llama.cpp, whisper.cpp and Piper — Day 0 support for every new model that ships.
Features
DeepNetz automatically selects the best backend for your hardware and optimizes memory usage to run models that shouldn't fit.
llama.cpp, HuggingFace Transformers, ExLlamaV2, vLLM, CTranslate2, and ONNX Runtime. DeepNetz picks the optimal one automatically.
Intelligent key-value cache management reduces VRAM usage significantly, enabling larger context windows on constrained hardware.
A clean, responsive chat interface ships out of the box. No separate frontend setup needed — just run and open your browser.
Native function/tool calling support compatible with OpenAI-style tool definitions. Build agents that interact with external APIs.
Detects your GPU, VRAM, CPU cores, and RAM at startup. Automatically configures quantization level, batch size, and layer offloading.
Drop-in replacement for the OpenAI API. Use your existing code and tools — just change the base URL to your local DeepNetz server.
Benchmarks
Measured on consumer hardware. No cherry-picked results. Quality delta is vs. uncompressed FP16 baseline.
| Model | Parameters | Quality Delta | Throughput | VRAM Used |
|---|---|---|---|---|
| Llama 3.2 | 3B | +0.4% | 42.1 tok/s | 2.8 GB |
| Gemma 2 | 27B | +2.0% | 8.7 tok/s | 14.2 GB |
| Qwen 2.5 | 35B | +2.7% | 6.3 tok/s | 18.1 GB |
| Command R+ | 122B | — | 1.3 tok/s | 48 GB (CPU offload) |
Benchmarked on a single RTX 4090 (24 GB VRAM), 64 GB DDR5 RAM. The 122B model uses partial CPU offloading. Quality delta measured on MMLU subset. Positive values = better than baseline (due to quantization noise regularization).
Comparison
Different tools for different needs. Here's where DeepNetz fits.
| Feature | DeepNetz | Ollama | LM Studio | vLLM |
|---|---|---|---|---|
| Free Tier | ✓ Apache-2.0 | ✓ MIT | ✕ Proprietary | ✓ Apache-2.0 |
| Multiple Backends | ✓ 6 backends | ✕ llama.cpp only | ✕ llama.cpp only | ✕ vLLM only |
| Hardware Auto-Detect | ✓ Full | ○ Basic | ○ Basic | ✕ Manual |
| KV Cache Compression | ✓ | ✕ | ✕ | ○ Experimental |
| Tool Calling | ✓ Native | ✓ | ○ Limited | ✓ |
| Web UI | ✓ Built-in | ✕ CLI only | ✓ Desktop app | ✕ |
| CPU + GPU Offloading | ✓ Auto | ○ Manual | ○ Manual | ✕ GPU only |
| Python Library | ✓ pip install | ✕ Binary | ✕ Binary | ✓ pip install |
| Production Batching | ○ Basic | ✕ | ✕ | ✓ Continuous |
Quick Start
Install, load a model, generate. That's it.
# Install
pip install deepnetz
# Python API
from deepnetz import DeepNetz
# Initialize — hardware is auto-detected
dn = DeepNetz()
# Load any supported model (GGUF, HF, GPTQ, AWQ, EXL2)
dn.load("meta-llama/Llama-3.2-3B-Instruct")
# Generate
response = dn.chat("Explain quantum computing in simple terms.")
print(response)
# Or start the server with Web UI
dn.serve(port=8000) # Opens http://localhost:8000
# Start a server (OpenAI-compatible API)
$ deepnetz serve --model meta-llama/Llama-3.2-3B-Instruct --port 8000
# Interactive chat in your terminal
$ deepnetz chat --model meta-llama/Llama-3.2-3B-Instruct
# Benchmark a model on your hardware
$ deepnetz bench --model meta-llama/Llama-3.2-3B-Instruct
Architecture
DeepNetz sits between your code and the inference backend. It handles hardware detection, model loading, and optimization automatically.