One App. Every Model. Every Machine.

The universal AI runtime
for your hardware

Run language, vision, speech, image and embedding models locally. CPU, GPU, NPU. Powered by upstream llama.cpp, whisper.cpp and Piper — Day 0 support for every new model that ships.

10
Model Types
130+
Curated Models
Day 0
Upstream Tracking

Built for real-world inference

DeepNetz automatically selects the best backend for your hardware and optimizes memory usage to run models that shouldn't fit.

6 Inference Backends

llama.cpp, HuggingFace Transformers, ExLlamaV2, vLLM, CTranslate2, and ONNX Runtime. DeepNetz picks the optimal one automatically.

KV Cache Compression

Intelligent key-value cache management reduces VRAM usage significantly, enabling larger context windows on constrained hardware.

Built-in Web UI

A clean, responsive chat interface ships out of the box. No separate frontend setup needed — just run and open your browser.

Tool Calling

Native function/tool calling support compatible with OpenAI-style tool definitions. Build agents that interact with external APIs.

Hardware Auto-Detection

Detects your GPU, VRAM, CPU cores, and RAM at startup. Automatically configures quantization level, batch size, and layer offloading.

OpenAI-Compatible API

Drop-in replacement for the OpenAI API. Use your existing code and tools — just change the base URL to your local DeepNetz server.

Honest numbers, real hardware

Measured on consumer hardware. No cherry-picked results. Quality delta is vs. uncompressed FP16 baseline.

Model Parameters Quality Delta Throughput VRAM Used
Llama 3.2 3B +0.4% 42.1 tok/s 2.8 GB
Gemma 2 27B +2.0% 8.7 tok/s 14.2 GB
Qwen 2.5 35B +2.7% 6.3 tok/s 18.1 GB
Command R+ 122B 1.3 tok/s 48 GB (CPU offload)

Benchmarked on a single RTX 4090 (24 GB VRAM), 64 GB DDR5 RAM. The 122B model uses partial CPU offloading. Quality delta measured on MMLU subset. Positive values = better than baseline (due to quantization noise regularization).

How DeepNetz compares

Different tools for different needs. Here's where DeepNetz fits.

Feature DeepNetz Ollama LM Studio vLLM
Free Tier ✓ Apache-2.0 ✓ MIT ✕ Proprietary ✓ Apache-2.0
Multiple Backends ✓ 6 backends ✕ llama.cpp only ✕ llama.cpp only ✕ vLLM only
Hardware Auto-Detect ✓ Full ○ Basic ○ Basic ✕ Manual
KV Cache Compression ○ Experimental
Tool Calling ✓ Native ○ Limited
Web UI ✓ Built-in ✕ CLI only ✓ Desktop app
CPU + GPU Offloading ✓ Auto ○ Manual ○ Manual ✕ GPU only
Python Library ✓ pip install ✕ Binary ✕ Binary ✓ pip install
Production Batching ○ Basic ✓ Continuous

Up and running in 30 seconds

Install, load a model, generate. That's it.

Python
# Install
pip install deepnetz

# Python API
from deepnetz import DeepNetz

# Initialize — hardware is auto-detected
dn = DeepNetz()

# Load any supported model (GGUF, HF, GPTQ, AWQ, EXL2)
dn.load("meta-llama/Llama-3.2-3B-Instruct")

# Generate
response = dn.chat("Explain quantum computing in simple terms.")
print(response)

# Or start the server with Web UI
dn.serve(port=8000)  # Opens http://localhost:8000
CLI
# Start a server (OpenAI-compatible API)
$ deepnetz serve --model meta-llama/Llama-3.2-3B-Instruct --port 8000

# Interactive chat in your terminal
$ deepnetz chat --model meta-llama/Llama-3.2-3B-Instruct

# Benchmark a model on your hardware
$ deepnetz bench --model meta-llama/Llama-3.2-3B-Instruct

How it works

DeepNetz sits between your code and the inference backend. It handles hardware detection, model loading, and optimization automatically.

Your Application
Python API · CLI · Web UI · REST API
DeepNetz Core Engine
Hardware Detector
  • GPU / VRAM
  • CPU / RAM
  • Platform
Model Router
  • Format Detection
  • Backend Selection
  • Auto-Download
Optimization
  • KV Cache Compression
  • Layer Offloading
  • Quant Selection
llama.cppGGUF
OllamaAPI
vLLMServer
HuggingFaceTransformers
LM StudioLocal
RemoteAPI