Roadmap

Roadmap — Atenia Engine

What is built, what is being built,
and what comes next.

Only capabilities backed by executable tests are documented as completed. Future directions are non-binding and subject to the same standard: observable behavior, reproducible tests, stability under reality.

Version What it closed Key result
v13–v19 Execution scaffolding — policy layer, guards, contracts, memory telemetry, sensor-to-decision pipeline Live hardware feedback loop
M4–M4.6 Safetensors loader, TinyLlama end-to-end, Llama-family expansion (4 models F64-validated) 4,096×–9,692× closer to F64 truth than PyTorch BF16
M4.7 Beyond-VRAM execution — Llama 2 13B on 8 GB VRAM + 32 GB RAM + NVMe [PASS] ✓ argmax bit-exact pre/post spill
M4.8 AVX2+FMA matmul dispatcher — SIMD + matrixmultiply 49.5× on production shape · 13B: 18.75 min → 5.38 min
M4.9 · M5 Public CLI · tokenizer + KV cache + autoregressive generation atenia run · atenia generate — coherent output
M6–M7 Tier-aware GPU loader — VRAM → RAM → NVMe automatic placement 1.46× on 7B · 13B on 32 GB without BSOD
M8 BF16-resident VRAM kernels — double effective VRAM capacity 1.31× on 7B · 1.36× on 13B · ADR-004 preserved
M8.7 Disk → GPU JIT streaming pipeline — async NVMe read + PCIe upload + GPU matmul 154 weights/forward · 98.7 % prefetch hit rate
Multi-family Llama, Qwen, Gemma, Phi, Mistral, SmolLM and Falcon3 validated end-to-end 7 families · safetensors + GGUF
Adapter Toolkit v2 Declarative YAML adapter specs — describe a model without writing Rust load · inspect · debug · v1 compatibility preserved
CLI Human errors, stable exit codes, logging levels, host/model diagnostics, interactive chat generate · chat · doctor · diagnose · capabilities

M8 — BF16 VRAM kernels (detail)

Weights stored as BF16 in VRAM (2 bytes/element instead of 4) doubles effective VRAM capacity — 82 projection weights fit in 8 GB instead of 38. The first implementation failed F64 validation: truncating both weight and activation to BF16 cascaded drift across 40 layers. The fix — upcast weight BF16→F32 transient per-matmul, keep activation F32 — dropped drift from 2.33 to 7.31e-4 (3,190× improvement on SmolLM2).

7B result 6.26 s/token vs 8.22 s/token M6 baseline — 1.31×
13B result 27.0 s/token vs 36.6 s/token M7.3 baseline — 1.36×
ADR-004 4/4 models pass · max drift 2.4e-2 · margin 21× over threshold

Multi-family support (detail)

Each model family — Llama, Qwen, Gemma, Phi, Mistral, SmolLM, Falcon3 — lives in its own adapter module. Family-specific behaviour (Phi-3’s LongRoPE, Gemma 2’s dual-norm and soft-cap, Qwen’s QKV biases, GGUF’s fused-weight conventions) stays contained there; the execution core never learns which family it is running. Adding a new family is a contained change, not a core modification.

Adapter Toolkit v2 takes this one step further: a model can be described in a small YAML file and validated with atenia load — no Rust, no recompilation. Classic Falcon, mixture-of-experts and multimodal models are explicitly out of scope and fail loud rather than producing wrong output.

Try it yourself

git clone https://github.com/AteniaEngine/ateniaengine.git
cd ateniaengine
cargo install --path .

# Check your machine is ready
atenia doctor

# Download a small model and chat with it
huggingface-cli download meta-llama/Llama-3.2-1B-Instruct \
    --local-dir ./models/llama-3.2-1b-instruct
atenia chat --model ./models/llama-3.2-1b-instruct
In progress

Performance optimization

With seven model families validated, a declarative adapter toolkit, and a full command-line interface in place, the engine is functionally broad. The current focus shifts to making it fast: tightening the beyond-VRAM execution path, the disk-to-GPU streaming pipeline, and the per-tier kernel dispatch.

Correctness stays locked first. Every optimization is measured against the same F64 reference and the same executable tests — speed is never bought with silent numeric drift.

Target Lower per-token latency on 13B beyond-VRAM execution
Constraint ADR-004 numeric contract preserved — no exceptions
  • v21 Production execution guards. Adaptive memory-pressure thresholds calibrated against real workload envelopes. Verdict stability under noisy signals. Structured logging and replay harnesses.
  • v22 Multi-vendor backend foundation. Vendor-neutral abstraction for hardware probes and kernel compilation. First target: NVIDIA discrete + Intel iGPU coexistence — a common laptop configuration other runtimes treat as single-backend.
  • v23 AMD ROCm backend. Substantial differences in driver model, memory management, and sync primitives — scoped as its own milestone.
  • v24 Apple Metal backend. Unified memory model, Metal Shading Language, Xcode-centric toolchain.
  • v25 Distributed execution. Multi-host execution reasoning. Out of scope until single-host execution is mature across vendors.

These are not missing features. They are deliberate boundaries.

  • Embedding machine learning into the execution control path
  • Modifying model semantics, numerical results, or training dynamics
  • Competing as a replacement for major ML frameworks
  • Performance-at-all-costs optimization that sacrifices stability
  • Opaque, black-box adaptation mechanisms
  • Mixture-of-experts, multimodal and encoder-decoder architectures

The capabilities listed as completed are backed by executable tests in the public repository. Any future direction must meet the same standard: observable execution behavior, reproducible tests, and stability under reality.

Browse the repository →