Roadmap – Atenia Engine

Roadmap — Atenia Engine

Roadmap — APX v12 → v25

What is built, what is being built,
and what comes next.

Only capabilities backed by executable tests are documented as completed. Future directions are non-binding and subject to the same standard: observable behavior, reproducible tests, stability under reality.

Completed

Version	What it closed	Key result
v13–v19	Execution scaffolding — policy layer, guards, contracts, memory telemetry, sensor-to-decision pipeline	Live hardware feedback loop
M4–M4.6	Safetensors loader, TinyLlama end-to-end, Llama-family expansion (4 models F64-validated)	4,096×–9,692× closer to F64 truth than PyTorch BF16
M4.7	Beyond-VRAM execution — Llama 2 13B on 8 GB VRAM + 32 GB RAM + NVMe	[PASS] ✓ argmax bit-exact pre/post spill
M4.8	AVX2+FMA matmul dispatcher — SIMD + matrixmultiply	49.5× on production shape · 13B: 18.75 min → 5.38 min
M4.9 · M5	Public CLI · tokenizer + KV cache + autoregressive generation	`atenia run` · `atenia generate` — coherent output
M6–M7	Tier-aware GPU loader — VRAM → RAM → NVMe automatic placement	1.46× on 7B · 13B on 32 GB without BSOD
M8	BF16-resident VRAM kernels — double effective VRAM capacity	1.31× on 7B · 1.36× on 13B · ADR-004 preserved
M8.7	Disk → GPU JIT streaming pipeline — async NVMe read + PCIe upload + GPU matmul	154 weights/forward · 98.7 % prefetch hit rate
Multi-family	Llama, Qwen, Gemma, Phi, Mistral, SmolLM and Falcon3 validated end-to-end	7 families · safetensors + GGUF
Adapter Toolkit v2	Declarative YAML adapter specs — describe a model without writing Rust	`load · inspect · debug` · v1 compatibility preserved
CLI	Human errors, stable exit codes, logging levels, host/model diagnostics, interactive chat	`generate · chat · doctor · diagnose · capabilities`

M8 — BF16 VRAM kernels (detail)

Weights stored as BF16 in VRAM (2 bytes/element instead of 4) doubles effective VRAM capacity — 82 projection weights fit in 8 GB instead of 38. The first implementation failed F64 validation: truncating both weight and activation to BF16 cascaded drift across 40 layers. The fix — upcast weight BF16→F32 transient per-matmul, keep activation F32 — dropped drift from 2.33 to 7.31e-4 (3,190× improvement on SmolLM2).

7B result 6.26 s/token vs 8.22 s/token M6 baseline — 1.31×

13B result 27.0 s/token vs 36.6 s/token M7.3 baseline — 1.36×

ADR-004 4/4 models pass · max drift 2.4e-2 · margin 21× over threshold

Multi-family support (detail)

Each model family — Llama, Qwen, Gemma, Phi, Mistral, SmolLM, Falcon3 — lives in its own adapter module. Family-specific behaviour (Phi-3’s LongRoPE, Gemma 2’s dual-norm and soft-cap, Qwen’s QKV biases, GGUF’s fused-weight conventions) stays contained there; the execution core never learns which family it is running. Adding a new family is a contained change, not a core modification.

Adapter Toolkit v2 takes this one step further: a model can be described in a small YAML file and validated with atenia load — no Rust, no recompilation. Classic Falcon, mixture-of-experts and multimodal models are explicitly out of scope and fail loud rather than producing wrong output.

Try it yourself

git clone https://github.com/AteniaEngine/ateniaengine.git
cd ateniaengine
cargo install --path .

# Check your machine is ready
atenia doctor

# Download a small model and chat with it
huggingface-cli download meta-llama/Llama-3.2-1B-Instruct \
    --local-dir ./models/llama-3.2-1b-instruct
atenia chat --model ./models/llama-3.2-1b-instruct

Current focus

In progress

Performance optimization

With seven model families validated, a declarative adapter toolkit, and a full command-line interface in place, the engine is functionally broad. The current focus shifts to making it fast: tightening the beyond-VRAM execution path, the disk-to-GPU streaming pipeline, and the per-tier kernel dispatch.

Correctness stays locked first. Every optimization is measured against the same F64 reference and the same executable tests — speed is never bought with silent numeric drift.

Target Lower per-token latency on 13B beyond-VRAM execution

Constraint ADR-004 numeric contract preserved — no exceptions

Future directions — v21 through v25

v21 Production execution guards. Adaptive memory-pressure thresholds calibrated against real workload envelopes. Verdict stability under noisy signals. Structured logging and replay harnesses.
v22 Multi-vendor backend foundation. Vendor-neutral abstraction for hardware probes and kernel compilation. First target: NVIDIA discrete + Intel iGPU coexistence — a common laptop configuration other runtimes treat as single-backend.
v23 AMD ROCm backend. Substantial differences in driver model, memory management, and sync primitives — scoped as its own milestone.
v24 Apple Metal backend. Unified memory model, Metal Shading Language, Xcode-centric toolchain.
v25 Distributed execution. Multi-host execution reasoning. Out of scope until single-host execution is mature across vendors.

Explicit non-goals

These are not missing features. They are deliberate boundaries.

Embedding machine learning into the execution control path
Modifying model semantics, numerical results, or training dynamics
Competing as a replacement for major ML frameworks
Performance-at-all-costs optimization that sacrifices stability
Opaque, black-box adaptation mechanisms
Mixture-of-experts, multimodal and encoder-decoder architectures

The capabilities listed as completed are backed by executable tests in the public repository. Any future direction must meet the same standard: observable execution behavior, reproducible tests, and stability under reality.

Browse the repository →

What is built, what is being built,and what comes next.

M8 — BF16 VRAM kernels (detail)

Multi-family support (detail)

Try it yourself

Performance optimization

What is built, what is being built,
and what comes next.