Inference

Backends

A3S Power uses a pluggable Backend trait. Four implementations are available depending on feature flags and configuration:

Prop

Type

Chat Completion

# Non-streaming
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is Rust?"}
    ]
  }'

# Streaming (token-by-token SSE)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Streaming delivers each token as it is generated via SSE (data: {...}). The connection closes with data: [DONE] when generation completes.

Text Completion

curl http://localhost:11434/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b",
    "prompt": "The quick brown fox",
    "max_tokens": 50
  }'

Embeddings

# Single input
curl http://localhost:11434/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3-embed", "input": "Hello world"}'

# Batch input
curl http://localhost:11434/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3-embed", "input": ["Hello world", "Goodbye world"]}'

Vision / Multimodal

# Base64 image via images field
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llava:7b",
    "messages": [{"role": "user", "content": "What is in this image?"}],
    "images": ["iVBORw0KGgoAAAANSUhEUgAA..."]
  }'

# OpenAI-style image_url content parts
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llava:7b",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQ..."}}
      ]
    }]
  }'

Supported image formats: JPEG, PNG, WebP.

Chat Templates

Power renders chat templates using Jinja2 (via minijinja). Templates are read from GGUF metadata or can be overridden in the model manifest via template_override.

Supported formats: ChatML, Llama 3, Mistral, Gemma, Phi, and custom Jinja2 templates.

When a new request shares a prefix with a previous request (same system prompt + conversation history), the cached KV state is reused. This provides significant speedup for multi-turn conversations and repeated system prompts.

Thinking / Reasoning Models

DeepSeek-R1, QwQ, and other reasoning models emit <think>...</think> blocks. Power's streaming parser separates thinking content from the final response, forwarding each as distinct chunks.

Layer-Streaming Inference (picolm)

Traditional LLM inference loads the entire model into RAM before generating a single token. A 7B Q4_K_M model needs ~4 GB. Inside a TEE, the Encrypted Page Cache (EPC) is often limited to 512 MB–1 GB. The model simply doesn't fit.

picolm solves this with layer-streaming: instead of loading all weights at once, it memory-maps the GGUF file and processes one transformer layer at a time. Only the current layer's weights occupy physical RAM.

How It Works

The implementation has two components:

gguf_stream.rs — Zero-Copy GGUF Parser: Opens the GGUF file via mmap(PROT_READ). Parses the header and tensor descriptors but loads no weight data. When picolm requests a layer's weights, tensor_bytes(name) returns a &[u8] slice directly into the mmap region — zero copy, zero allocation. The OS pages in data on demand and reclaims it under memory pressure.

picolm.rs and picolm_ops/ — Layer-Streaming Forward Pass: Iterate through each transformer layer (blk.0.* to blk.{n-1}.*), applying attention, feed-forward, normalization, RoPE, KV-cache, sampling, grammar, and tool-call logic before moving on. After processing layer N, madvise(MADV_DONTNEED) releases its mapped pages before layer N+1 is paged in.

Memory Comparison

Model	Traditional	picolm	Reduction
3B Q4_K_M (~2 GB)	~2 GB	~60 MB + KV	33×
7B Q4_K_M (~4 GB)	~4 GB	~120 MB + KV	33×
13B Q4_K_M (~7 GB)	~7 GB	~200 MB + KV	35×
70B Q4_K_M (~40 GB)	~40 GB	~1.1 GB + KV	36×

Encrypted Model Support

For encrypted models (.enc), streaming_decrypt = true uses LayerStreamingDecryptedModel: the full plaintext is decrypted into mlock-pinned RAM once, zeroized on drop, and exposed through chunk reads wrapped in Zeroizing<Vec<u8>>. picolm can process one layer-sized chunk at a time so only the active working copy needs to survive between layer steps. AES-256-GCM is authenticated but not seekable, so this mode avoids plaintext disk writes and bounds the active working set; it does not make ciphertext randomly decryptable by layer.

Implemented Forward Path

picolm is an implemented pure Rust inference path, not a placeholder. Current source includes:

Multi-head attention with Grouped-Query Attention (GQA) and Q/K/V bias support.
SwiGLU and GeGLU feed-forward variants.
RoPE tables, RMSNorm, FP16 KV cache, and fused f16 dot/accumulate paths.
Q4_K, Q5_K, Q6_K, Q8_0, Q4_0, F16, and F32 dequantization support.
Fused dequant+dot kernels, Rayon parallel matmul, aarch64 NEON and x86_64 AVX2 hot paths.
Batch prefill, grammar-constrained JSON output, tool-call parsing, and selectable speculative-decoding modes.
Startup self-tests for critical numeric kernels before serving a picolm model.

Auto-Loading

When a request arrives for a model that isn't loaded, Power automatically loads it. If max_loaded_models is reached, the least-recently-used model is evicted first.

Inference

Inference

Backends

Chat Completion

Text Completion

Embeddings

Vision / Multimodal

Generation Parameters

Chat Templates

KV Cache Reuse

Thinking / Reasoning Models

Layer-Streaming Inference (picolm)

How It Works

Memory Comparison

Encrypted Model Support

Implemented Forward Path

Auto-Loading

On this page