Models

Supported Formats

Prop

Type

Register a Model

# GGUF model
curl -X POST http://localhost:11434/v1/models \
  -H "Content-Type: application/json" \
  -d '{"name": "llama3.2:3b", "path": "/models/llama3.2-3b-q4_k_m.gguf"}'

# SafeTensors chat model (ISQ quantization on load)
curl -X POST http://localhost:11434/v1/models \
  -H "Content-Type: application/json" \
  -d '{
    "name": "qwen2.5:7b",
    "path": "/models/Qwen2.5-7B-Instruct",
    "format": "safetensors",
    "default_parameters": {"isq": "Q4K"}
  }'

# Vision model
curl -X POST http://localhost:11434/v1/models \
  -H "Content-Type: application/json" \
  -d '{"name": "llava:7b", "path": "/models/llava-7b", "format": "vision"}'

# Embedding model
curl -X POST http://localhost:11434/v1/models \
  -H "Content-Type: application/json" \
  -d '{"name": "qwen3-embed", "path": "/models/Qwen3-Embedding", "format": "huggingface"}'

Pull from HuggingFace Hub

Requires the hf feature (included in default builds).

# By quantization tag — resolves filename via HF API
curl -N http://localhost:11434/v1/models/pull \
  -H "Content-Type: application/json" \
  -d '{"name": "bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M"}'

# By exact filename
curl -N http://localhost:11434/v1/models/pull \
  -H "Content-Type: application/json" \
  -d '{"name": "bartowski/Llama-3.2-3B-Instruct-GGUF/Llama-3.2-3B-Instruct-Q4_K_M.gguf"}'

# Private/gated model with HF token
curl -N http://localhost:11434/v1/models/pull \
  -H "Content-Type: application/json" \
  -d '{"name": "meta-llama/Llama-3.1-8B-Instruct/Meta-Llama-3.1-8B-Q4_K_M.gguf", "token": "HF_ACCESS_TOKEN"}'

# Force re-download
curl -N http://localhost:11434/v1/models/pull \
  -H "Content-Type: application/json" \
  -d '{"name": "bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M", "force": true}'

Set HF_TOKEN env var as an alternative to passing token in the request body.

SSE Progress Events

data: {"status":"resuming","offset":104857600,"total":2147483648}
data: {"status":"downloading","completed":524288000,"total":2147483648}
data: {"status":"verifying"}
data: {"status":"success","id":"bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M","object":"model"}

Interrupted downloads resume automatically — the partial file is identified by a SHA-256 of the download URL and resumed via HTTP Range requests.

Check Pull Status

# Get persisted pull progress (survives server restarts)
curl http://localhost:11434/v1/models/pull/bartowski%2FLlama-3.2-3B-Instruct-GGUF:Q4_K_M/status

{"name": "bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M", "status": "pulling", "completed": 524288000, "total": 2147483648}

List and Inspect Models

# List all registered models
curl http://localhost:11434/v1/models

# Get a specific model
curl http://localhost:11434/v1/models/llama3.2:3b

Delete a Model

curl -X DELETE http://localhost:11434/v1/models/llama3.2:3b

Storage Layout

~/.a3s/power/
├── models/
│   ├── manifests/           # JSON metadata per model
│   │   ├── llama3.2-3b.json
│   │   └── qwen2.5-7b.json
│   └── blobs/               # Content-addressed by SHA-256
│       ├── sha256-a1b2c3...  # Model weights
│       └── sha256-d4e5f6...
└── pulls/                   # Pull progress state (resume)
    └── <sha256-of-url>.json

Model blobs are stored by SHA-256 hash — identical weights shared across model names are stored only once.

ISQ Quantization Types (SafeTensors)

Prop

Type

Model Lifecycle

Register/Pull → Auto-load on first request → Serve → Idle → Evict (LRU)
                        ↑                                        │
                        └────────────────────────────────────────┘

Models load automatically on first inference request
Idle models unload after keep_alive duration (default 5m)
When max_loaded_models is reached, the least-recently-used model is evicted

Models

On this page