One App. Every Model. Every Machine.

The universal AI runtime
for your hardware

Run language, vision, speech, image and embedding models locally. CPU, GPU, NPU. Powered by upstream llama.cpp, whisper.cpp and Piper — Day 0 support for every new model that ships.

Download for Windows, macOS, Linux See what it does

Model Types

130+

Curated Models

Day 0

Upstream Tracking

Features

Built for real-world inference

DeepNetz automatically selects the best backend for your hardware and optimizes memory usage to run models that shouldn't fit.

6 Inference Backends

llama.cpp, HuggingFace Transformers, ExLlamaV2, vLLM, CTranslate2, and ONNX Runtime. DeepNetz picks the optimal one automatically.

KV Cache Compression

Intelligent key-value cache management reduces VRAM usage significantly, enabling larger context windows on constrained hardware.

Built-in Web UI

A clean, responsive chat interface ships out of the box. No separate frontend setup needed — just run and open your browser.

Tool Calling

Native function/tool calling support compatible with OpenAI-style tool definitions. Build agents that interact with external APIs.

Hardware Auto-Detection

Detects your GPU, VRAM, CPU cores, and RAM at startup. Automatically configures quantization level, batch size, and layer offloading.

OpenAI-Compatible API

Drop-in replacement for the OpenAI API. Use your existing code and tools — just change the base URL to your local DeepNetz server.

Benchmarks

Honest numbers, real hardware

Measured on consumer hardware. No cherry-picked results. Quality delta is vs. uncompressed FP16 baseline.

Model	Parameters	Quality Delta	Throughput	VRAM Used
Llama 3.2	3B	+0.4%	42.1 tok/s	2.8 GB
Gemma 2	27B	+2.0%	8.7 tok/s	14.2 GB
Qwen 2.5	35B	+2.7%	6.3 tok/s	18.1 GB
Command R+	122B	—	1.3 tok/s	48 GB (CPU offload)

Benchmarked on a single RTX 4090 (24 GB VRAM), 64 GB DDR5 RAM. The 122B model uses partial CPU offloading. Quality delta measured on MMLU subset. Positive values = better than baseline (due to quantization noise regularization).

Comparison

How DeepNetz compares

Different tools for different needs. Here's where DeepNetz fits.

Feature	DeepNetz	Ollama	LM Studio	vLLM
Free Tier	✓ Apache-2.0	✓ MIT	✕ Proprietary	✓ Apache-2.0
Multiple Backends	✓ 6 backends	✕ llama.cpp only	✕ llama.cpp only	✕ vLLM only
Hardware Auto-Detect	✓ Full	○ Basic	○ Basic	✕ Manual
KV Cache Compression	✓	✕	✕	○ Experimental
Tool Calling	✓ Native	✓	○ Limited	✓
Web UI	✓ Built-in	✕ CLI only	✓ Desktop app	✕
CPU + GPU Offloading	✓ Auto	○ Manual	○ Manual	✕ GPU only
Python Library	✓ pip install	✕ Binary	✕ Binary	✓ pip install
Production Batching	○ Basic	✕	✕	✓ Continuous

Quick Start

Up and running in 30 seconds

Install, load a model, generate. That's it.

Python

# Install
pip install deepnetz

# Python API
from deepnetz import DeepNetz

# Initialize — hardware is auto-detected
dn = DeepNetz()

# Load any supported model (GGUF, HF, GPTQ, AWQ, EXL2)
dn.load("meta-llama/Llama-3.2-3B-Instruct")

# Generate
response = dn.chat("Explain quantum computing in simple terms.")
print(response)

# Or start the server with Web UI
dn.serve(port=8000)  # Opens http://localhost:8000

CLI

# Start a server (OpenAI-compatible API)
$ deepnetz serve --model meta-llama/Llama-3.2-3B-Instruct --port 8000

# Interactive chat in your terminal
$ deepnetz chat --model meta-llama/Llama-3.2-3B-Instruct

# Benchmark a model on your hardware
$ deepnetz bench --model meta-llama/Llama-3.2-3B-Instruct

Architecture

How it works

DeepNetz sits between your code and the inference backend. It handles hardware detection, model loading, and optimization automatically.

Your Application

Python API · CLI · Web UI · REST API

▼

DeepNetz Core Engine

Hardware Detector

GPU / VRAM
CPU / RAM
Platform

Model Router

Format Detection
Backend Selection
Auto-Download

Optimization

KV Cache Compression
Layer Offloading
Quant Selection

▼

llama.cppGGUF

OllamaAPI

vLLMServer

HuggingFaceTransformers

LM StudioLocal

RemoteAPI

The universal AI runtimefor your hardware