Agent Nodes
This guide is the implementation reference for Agent Node: a self-hosted
agent-runtimes server that registers to Datalayer Runtimes and serves as
a node-local execution endpoint for AI agent workloads.
It is the canonical developer documentation. End-user documentation lives in
the central UI under Docs → Agent Nodes.
Documentation Map
- End-user overview: /docs/agent-nodes
- End-user laptop workflow: /docs/agent-nodes-laptop
- End-user AWS workflow: /docs/agent-nodes-aws
- CloudFormation definitions source: github.com/datalayer/agent-runtimes/tree/main/aws
If you are operating a Dockerized node, the local entrypoint is typically
http://localhost:8765 and resolves to the Agent Node UI
(/html/agent-node.html) in node mode.
Scope and Goals
- End-to-end Agent Node lifecycle: register, heartbeat, health, configure mode, list, inspect, and route chat.
- Inference routing control per Agent Node runtime:
inferenceProvider = local | datalayer. - Keep configuration source-of-truth local on each node.
- Mode taxonomy:
private | shared | sleep. - Central visibility: own nodes (all modes) + others only in
sharedmode. - Datalayer Runtimes (not operator) is the control plane and tunnel hub.
- Ship a dockerized runtime and a simple local development workflow.
Architecture Decisions
- Datalayer Runtimes is the system of record for Agent Nodes;
datalayer-operatoris no longer involved in Agent Node lifecycle. - The central service mints node identity (ULID) on first register. The local node persists whatever id the service returns and reuses it on subsequent calls.
- A node-to-runtimes tunnel is required: UI ↔ Node chat traffic transits through Datalayer Runtimes; nodes never accept inbound traffic from the UI.
- Visibility model in the central UI:
- show my nodes (all modes) with live status,
- show other users' Shared Mode nodes only,
- never expose other users'
privateorsleepnodes.
- Agent Node UI is node-local administration; the central UI is read-only for configuration and operational visibility.
- One docker image launches
agent-runtimes+ Agent Node UI supporting all three modes.
Modes
| Mode | Purpose |
|---|---|
private | Owner-only execution under a selected billable account. |
shared | Discoverable / usable by others per the sharing policy. |
sleep | Registered and visible but does not accept execution. |
Inference Provider
Agent Node supports two inference providers:
| Provider | Behavior |
|---|---|
local | Uses local model/provider resolution in agent-runtimes. |
datalayer | Routes LLM calls through datalayer-ai-inference (/api/ai-inference/v1/*). |
Effective Default
- In Agent Node mode, the effective default is
datalayer. - Outside Agent Node mode, the default remains
local. AGENT_RUNTIMES_INFERENCE_PROVIDER_OVERRIDEcan pin either value.
Runtime Configuration API
The node-local configuration API exposes inference controls:
GET /api/v1/configure/inference/providerPUT /api/v1/configure/inference/providerGET /api/v1/configure/inference/models
/inference/models is provider-aware:
- For
local, returns an empty model list. - For
datalayer, proxiesdatalayer-ai-inference /models. - If upstream returns no models (or is unavailable), Agent Node applies a
Bedrock Anthropic fallback list derived from
agentspecs(with env fallback), so the UI can still present usable choices. - For
datalayer, the payload may also includebedrock_anthropic_model_specsmetadata used by the UI to preselect the active default model.
Mode is part of every heartbeat and every health report. Changing the mode in the local UI synchronously triggers an extra health report so the central registry reflects the new state immediately.
Canonical Models
AgentNodeConfiguration
{
"mode": "private", // private | shared | sleep
"node_uid": "01JABCDE...", // ULID assigned by central service
"billable_account_uid": "...",
"billable_account_type": "...",
"billable_account_handle": "...",
"sharing": { /* sharing policy for shared mode */ }
}
AgentNodeHealth
{
"mode": "private",
"hostname": "node-host",
"platform": "Linux",
"platform_release": "6.5.0-...",
"python_version": "3.13.0",
"agent_runtimes_version": "dev",
"cpu_count": 16,
"cpu_percent": 12.3,
"memory_total_mb": 64000,
"memory_available_mb": 41280,
"load_average": [0.32, 0.45, 0.51],
"uptime_seconds": 1234.5,
"reported_at": "...",
"reason": "periodic" // periodic | mode_change | startup
}
AgentNodeRecord (central)
Adds owner identity, status, timestamps, capabilities, plus the latest
health snapshot and last_health_at.
Identity (node_uid)
- The central
datalayer-runtimesservice is the sole authority for minting Agent Node identifiers. - On the first
POST /api/runtimes/v1/agent-nodes/registerfrom a node, the service mints a ULID and returns it asagent_node.node_id. - The local node persists that id as
configuration.node_uidin~/.datalayer/agent-node.json(override withAGENT_NODE_STATE_PATH) and reuses it for all subsequent/register,/heartbeat, and/healthcalls. AGENT_NODE_IDremains an operator escape hatch for externally pinned ids. When set, no service-side minting occurs and the value is used verbatim.
Synchronization Loop
The background sync task in agent_runtimes/agent_node_sync.py:
- Registers with the central service. On first call, omits
node_id; captures the assigned ULID from the response and persists it. - Sends a
heartbeatwith the current configuration. - Sends a
healthsnapshot:- on startup (
reason="startup"), - every
AGENT_NODE_HEALTH_SECONDS(default60s,reason="periodic"), - immediately on any local mode change (
reason="mode_change").
- on startup (
- Waits up to
AGENT_NODE_HEARTBEAT_SECONDS(default20s) before the next tick, waking early on stop or mode-change.
Mode-change triggering uses register_mode_change_callback() exported by
routes/agent_node.py. The callback fires inside set_agent_node_configuration
when the new mode differs from the persisted one.
Endpoints
Local node — agent-runtimes
GET /api/v1/agent-node/configuration— read persisted configuration includingnode_uid.POST /api/v1/agent-node/configuration— replace configuration. Thenode_uidis owned by the central service and is never overwritten by this endpoint; onlyset_agent_node_uid()updates it.GET /api/v1/configure/inference/provider— read effective runtime inference provider (local | datalayer).PUT /api/v1/configure/inference/provider— update runtime inference provider for subsequent launches/sessions.GET /api/v1/configure/inference/models— list available models for the active provider (Bedrock Anthropic focused fordatalayer).
Central — datalayer-runtimes
POST /api/runtimes/v1/agent-nodes/register—node_idoptional; ULID minted viadatalayer_common.utils.new_ulidwhen missing.POST /api/runtimes/v1/agent-nodes/heartbeat—node_id+ currentconfiguration.POST /api/runtimes/v1/agent-nodes/configuration— update configuration.POST /api/runtimes/v1/agent-nodes/health— periodic / on-demand health snapshot. Also refresheslast_seen_atand, ifhealth.modediffers from the registered configuration, reconciles the registered mode.GET /api/runtimes/v1/agent-nodes— list visible nodes (own + others shared).GET /api/runtimes/v1/agent-nodes/{node_id}— fetch one node.- Tunnel endpoints:
GET .../tunnel/ws,POST .../{node_id}/tunnel/messages,GET .../{node_id}/tunnel/events,GET .../{node_id}/tunnel/status.
Environment Variables
| Variable | Purpose |
|---|---|
DATALAYER_RUNTIMES_URL | Base URL of central runtimes service. |
DATALAYER_API_KEY | Bearer token for register/heartbeat/health. |
AGENT_NODE_ID | Optional externally pinned node id. |
AGENT_NODE_NAME | Optional display name (defaults to hostname). |
AGENT_NODE_MODE | Initial mode (only on first start before any persistence). |
AGENT_NODE_STATE_PATH | Override for the persisted configuration JSON. |
AGENT_NODE_HEARTBEAT_SECONDS | Heartbeat cadence (default 20). |
AGENT_NODE_HEALTH_SECONDS | Health cadence (default 60). |
AGENT_RUNTIMES_VERSION | Embedded in heartbeat / health payloads. |
AGENT_RUNTIMES_INFERENCE_PROVIDER_OVERRIDE | Runtime override for local / datalayer. |
DATALAYER_AI_INFERENCE_URL | Base URL for datalayer-ai-inference (default local :4450). |
DATALAYER_INFERENCE_PROVIDER | ai-inference backend provider (e.g. bedrock, azure). |
DATALAYER_BEDROCK_* | Bedrock credentials/region/model inputs used by ai-inference. |
UI Flow (Node-Local)
The src/AgentNode.tsx app exposes three steps:
- Authentication — sign in (token or API key). Configuration / Chat header entries are hidden until authenticated.
- Configuration — pick mode, billable account, and sharing policy.
Also pick Inference provider (
localordatalayer). Whendatalayeris selected, the UI shows available Bedrock Anthropic models returned by/api/v1/configure/inference/modelsas a read-only selector. The active option is preselected from model metadata/defaults, and model switching is intentionally locked in this view. - Chat — node-local AG-UI chat. Only enabled when
configuration.mode === 'private'.
If mode is changed away from private while in the chat view, the user is
returned to the configuration view.
Security and Authorization
- Runtimes is the single public control plane for Agent Nodes.
- The tunnel is authenticated and bound to the owner identity.
- UI requests are validated by user identity and node access policy.
- Shared-mode access follows the explicit sharing policy.
Development Workflow
# Local UI + Python server
make agent-nodes
# Local UI + Python against a local plane stack
make agent-nodes:proxy
# Dockerized runtime
make agent-nodes-docker-build DOCKER_TAG=dev
docker run --rm -p 8765:8765 datalayer/agent-nodes:dev
For local proxy development, make agent-nodes:proxy exports both
DATALAYER_AI_INFERENCE_URL and VITE_DATALAYER_AI_INFERENCE_URL to the
local ai-inference service (http://localhost:4450 by default).
plane local also starts the local ai-inference service and waits for
/api/ai-inference/v1/health readiness.
Risks and Mitigations
- Tunnel instability — reconnect with backoff, heartbeats, session expiry.
- Visibility regressions — explicit policy tests for own/shared-only queries.
- Node-local vs central drift — local configuration and central registry can diverge transiently during network disruption; mitigated by periodic heartbeat/health sync plus immediate mode-change health reports.
- Registry restart loss — v1 in-memory registry is documented as expected; persistence is a follow-up.
Acceptance Criteria
- Node can run in
private,shared, andsleepmodes. - Node registers and heartbeats directly against the runtimes-owned registry; ULID is minted server-side on first register.
- Health is reported on startup, on a fixed cadence, and on every mode change.
- Tunnel routes chat traffic both ways via runtimes.
- UI Agent Nodes page shows my nodes (all modes) + other users' shared nodes, including node id, mode, status, host, and last-seen.
- UI supports navigation from a node row/detail to chat via the selected node.
- Docker image launches node UI + server with all three modes.
make agent-nodessupports easy local UI + Python development.