Multi-Machine
Placing orchestration steps across machines via the AgentExecutor seam
Multi-Machine
A3S Code runs multi-agent orchestration as a grammar expressed in code, then
places the resulting steps wherever you want them to run. The split is drawn
along the framework / host boundary, introduced in [3.4.0]:
- The framework owns the orchestration grammar and the serializable data contracts. It never decides where a step runs.
- The host owns placement, transport, and scheduling — which node executes a step, how the spec gets there, and how concurrency maps to a cluster.
The single point of contact between the two is one trait, AgentExecutor.
The framework / host boundary
The framework's contract is two serializable types:
AgentStepSpec— what to run, independent of where:task_id,agent,description,prompt, and optionalmax_steps,parent_session_id,output_schema.StepOutcome— the result of running one spec:task_id,session_id,agent,output,success, and optionalstructured.
Both serialize cleanly, so a host may ship a spec to another node and persist an
outcome in a checkpoint. The combinators that compose specs are written purely
against the AgentExecutor trait and never observe where a step ran, so the
same orchestration scales from one process to a cluster without changing.
The AgentExecutor seam
AgentExecutor is the boundary between the grammar and the host:
combinators (parallel / pipeline / resumable)
-> AgentExecutor::execute_step(spec) -> StepOutcome
├─ in-box TaskExecutor: runs the step as a local child agent
└─ host executor: places the step on a remote nodeThe in-box TaskExecutor runs each step as a child agent locally — in-process,
on Tokio — inheriting the session's agent registry, LLM client, workspace, MCP
tools, and subagent tracker. A host such as a cluster runtime substitutes its own executor
to place steps across a cluster; the combinators are unaffected.
concurrency_hint() is advisory, not a hard local bound. The local default
returns the session's max_parallel_tasks; a scheduler-backed host may return
its cluster-wide target. Because it is a hint rather than a ceiling,
orchestration scales past a single process.
A session exposes the in-box seam directly:
AgentSession::agent_executor()returns a session-backedAgentExecutor.AgentSession::session_store()returns the session's store (when one is configured), which the resumable combinator needs to journal progress.
The SDK grammar below calls agent_executor() for you; you only reach for these
when implementing or substituting a custom executor.
Parallel: barrier fan-out
execute_steps_parallel fans specs out across the executor and awaits all of
them (a barrier). Results preserve input order, a panicked branch becomes a
failed StepOutcome instead of dropping the batch, and concurrency is bounded
by the executor's concurrency hint.
const outcomes = await session.parallel([
{ taskId: 'a', agent: 'explore', description: 'survey', prompt: 'Map the auth module' },
{ taskId: 'b', agent: 'review', description: 'audit', prompt: 'Review error handling' },
]);
for (const o of outcomes) {
console.log(o.taskId, o.success, o.output);
}outcomes = session.parallel([
{"task_id": "a", "agent": "explore", "description": "survey", "prompt": "Map the auth module"},
{"task_id": "b", "agent": "review", "description": "audit", "prompt": "Review error handling"},
])
for o in outcomes:
print(o["task_id"], o["success"], o["output"])Pipeline: per-item chains, no inter-stage barrier
execute_pipeline runs each item through a chain of PipelineStages
independently. There is no barrier between stages — item A can be in stage 3
while item B is still in stage 1 — so wall-clock is the slowest single chain,
not the sum of the slowest step per stage.
A stage is a (ctx) => spec | null callback where ctx carries the previous
outcome and the original item. Return a spec to run the next step, or null to
stop that item's chain early; a chain also stops when a step fails (later stages
would only build on a failed result). The bridges fail closed: a stage that
hangs, returns null, or raises stops only its own chain.
A Node stage callback must not throw — wrap your logic in
try/catchand returnnullon error (the same constraint assetBudgetGuard). A stage that hangs past the timeout fails closed for that chain. A Python stage that raises is caught and treated asnull.
const outcomes = await session.pipeline(['src/auth', 'src/api'], [
(ctx) => ({ taskId: 's1', agent: 'explore', description: 'survey', prompt: `Survey ${ctx.item}` }),
(ctx) => ctx.previous?.success
? { taskId: 's2', agent: 'review', description: 'review', prompt: `Review: ${ctx.previous.output}` }
: null,
]);def survey(ctx):
return {"task_id": "s1", "agent": "explore", "description": "survey",
"prompt": f"Survey {ctx['item']}"}
def review(ctx):
prev = ctx["previous"]
if prev and prev["success"]:
return {"task_id": "s2", "agent": "review", "description": "review",
"prompt": f"Review: {prev['output']}"}
return None
outcomes = session.pipeline(["src/auth", "src/api"], [survey, review])Resumable: cross-node resume
execute_steps_parallel_resumable is parallel plus a journal. At each step
boundary it writes a WorkflowCheckpoint to the SessionStore under a
workflowId. On resume it skips already-completed steps and re-dispatches the
rest. It records only successful steps, so a failed step retries on resume —
its effect did not complete.
A SessionStore is required. The Node bridge rejects with
"parallelResumable requires a sessionStore" when none is configured; Python
raises the equivalent.
import { Agent, FileSessionStore } from '@a3s-lab/code';
const session = agent.session('/repo', {
sessionStore: new FileSessionStore('./.a3s/sessions'),
});
const outcomes = await session.parallelResumable([
{ taskId: 'a', agent: 'explore', description: 'survey', prompt: 'Map the auth module' },
{ taskId: 'b', agent: 'review', description: 'audit', prompt: 'Review error handling' },
], 'release-audit');from a3s_code import Agent, FileSessionStore
session = agent.session("/repo", session_store=FileSessionStore("./.a3s/sessions"))
outcomes = session.parallel_resumable([
{"task_id": "a", "agent": "explore", "description": "survey", "prompt": "Map the auth module"},
{"task_id": "b", "agent": "review", "description": "audit", "prompt": "Review error handling"},
], "release-audit")Because the checkpoint is serializable and the executor is a parameter, a host can resume an interrupted workflow on a different node by passing that node's executor — the framework migrates the what, the host supplies the where.
Schema-forced step output
A spec carrying an output_schema (outputSchema in Node) must return a value
conforming to that JSON Schema; the validated object lands in
StepOutcome.structured. The executor reuses the structured-output coercion and
repair machinery. A coercion failure demotes the step to unsuccessful, so a
caller never treats unvalidated text as the promised object.
const [finding] = await session.parallel([{
taskId: 'classify', agent: 'review', description: 'classify', prompt: 'Classify this defect',
outputSchema: {
type: 'object',
properties: { severity: { type: 'string' }, summary: { type: 'string' } },
required: ['severity', 'summary'],
},
}]);
if (finding.success && finding.structured) {
console.log(finding.structured.severity);
}Placing orchestration steps across machines
Lane queues remain a valid transport — but they are now one option behind the
AgentExecutor seam, not the sole integration point. To distribute work, a
host implements AgentExecutor::execute_step over whatever transport fits its
platform (HTTP, message queues, a job system, lane queues, internal RPC), wires
concurrency_hint() to its cluster target, and hands that executor to the
combinators. The grammar — parallel, pipeline, resumable — stays identical; only
placement moves.
The result: a coordinator session owns the conversation, final synthesis, and release decision, while its orchestration steps execute wherever the host places them, with cross-node resume carried by the serializable checkpoint.
See Orchestration for the combinator grammar in depth and Persistence for the checkpoint store.