Home

Atenia Engine
LLM Inference Runtime — Rust — Apache 2.0

Run models on hardware
where other engines give up.

Atenia Engine is a from-scratch LLM inference runtime written in Rust. It runs Llama 2 13B Chat on a laptop with 8 GB of VRAM and 32 GB of RAM — spreading the model automatically across VRAM, RAM, and NVMe so it simply fits, with nothing for you to configure.

Seven model families. One interactive command. Every answer checked against exact mathematics.

Try it

Talk to a model in one command.

Download any supported checkpoint, point Atenia at the folder, and start chatting. No Python environment, no inference server, no configuration — just the model and a prompt.

atenia chat --model ./models/llama-3.2-1b-instruct

An interactive, multi-turn chat in your terminal. Prefer a single answer? Use atenia generate --prompt "...". Not sure your machine is ready? Run atenia doctor first — it checks your CPU, RAM, GPU and build in one line.

Where the project stands

Milestone What it delivered Result
M4.7 Beyond-VRAM execution — Llama 2 13B on 8 GB VRAM [PASS] ✓ bit-exact pre/post spill
M5 Tokenizer + KV cache + text generation atenia generate — coherent chat output
M6–M8 Tier-aware loader + BF16-resident VRAM kernels 1.31×–1.46× faster, bit-identical output
M8.7 Disk → GPU JIT streaming pipeline 154 weights/forward · 98.7 % prefetch hit rate
Multi-family Llama, Qwen, Gemma, Phi, Mistral, SmolLM, Falcon3 7 families validated
Adapter Toolkit v2 Describe a model in a small YAML file — no Rust required load · inspect · debug
CLI Human errors, logging, diagnostics, interactive chat generate · chat · doctor · diagnose

The demo that started it

Llama 2 13B Chat on hardware where it barely fits.

The model is 26 GB. The VRAM is 8 GB. The RAM is 32 GB. Most engines stop here. Atenia moves parameters between tiers as execution proceeds — no intervention required.

166s
Load 363 params (26 GB)
23s
Post-spill forward
same argmax as pre-spill
BIT-EXACT
Spill transparency
argmax(pre) == argmax(post)
git clone https://github.com/AteniaEngine/ateniaengine.git
cd ateniaengine
cargo install --path .

huggingface-cli download meta-llama/Llama-2-13b-chat-hf \
    --local-dir ./models/llama-2-13b-chat

atenia run --mode c \
    --model ./models/llama-2-13b-chat \
    --cache-dir ./atenia-cache

Expected output: [PASS] ✓ argmax(pre-spill) == argmax(post-spill) bit-exactly in approximately 7 minutes on a 32 GB / 8 GB box.

The models you already use

Bring your own checkpoint — Atenia probably already runs it.

Seven model families have been validated end-to-end: they load, they generate coherent text, and they stop cleanly. Both HuggingFace safetensors and quantised GGUF files work through the same path.

Family Examples Status
LlamaTinyLlama · Llama 2 · Llama 3.1 / 3.2validated
QwenQwen 2.5 · Qwen3validated
GemmaGemma 2 · Gemma 3 (text)validated
PhiPhi-3 · Phi-3.5 · Phi-4 minivalidated
MistralMistral 7B v0.2 / v0.3validated
SmolLMSmolLM · SmolLM2validated
Falcon3Falcon3 1B / 3B / 7Bvalidated

Not sure a specific model will work? atenia diagnose --model ./your-model checks it in seconds, before you ever load it.

Mathematically correct by design

Validated against exact math — not against other frameworks.

Most inference engines run in BF16 by default. It is fast and memory-efficient — and it drifts from the true answer in ways that are rarely measured. Atenia measures every certified model against an F64 (double-precision) reference and publishes the result with the model.

Same checkpoints, same weights, same inputs. Every number below is reproducible on your own machine with one cargo test command.

Model Atenia F32 PyTorch BF16
TinyLlama 1.1B 0.000141 0.732 5,198× worse
SmolLM2 1.7B 0.001446 14.01 9,692× worse
Qwen 2.5 1.5B 0.000346 1.531 4,420× worse
Llama 3.2 1B 0.000132 0.539 4,096× worse

Lower is closer to the true answer. Each certified checkpoint ships a signed numcert.json that records its measured drift — no other inference runtime publishes this.

Nothing here is a promise

Every claim on this page is backed by a test you can run.

The repository ships a large suite of executable tests covering tensor math, model loading, generation, numeric drift and the CLI itself — not benchmarks on ideal hardware, not cherry-picked prompts. Clone the repo and check for yourself:

cargo test --lib

What Atenia Engine is not

  • Not a machine learning framework
  • Not a compiler or graph optimizer
  • Not a performance-at-all-costs system
  • Not a heuristic tuning layer

Atenia does not modify model semantics. It makes smarter decisions about where parameters live and when they move — and keeps those decisions visible and auditable.

References

Project, research and authorship