hand-waveIntroduction

Nodeoperator AI is an autonomous node operator agent that deploys, manages, and remediates issues with blockchain infrastructure using GitOps as a human-in-the-loop control model.

Why an AI Node Operator?

Running blockchain infrastructure today is manual, fragile, and error-prone. Nodeoperators and Solo Stakers:

  • Manually Track upstream client releases

  • Perform risky upgrades

  • Debug failing nodes under time pressure

  • Maintain complex Kubernetes environments

Nodeoperator AI is designed to reduce this operational burden while keeping humans in control.

Is AI agent safe for critical infrastructure?

We recognize the concerns about placing critical infrastructure under the control of an AI agent:

"Is AI ready? Can it be trusted with critical infra? What about hallucinations and unpredictable execution?. Is this just jumping on another shining new AI tool?"

These questions are valid and we addressed them head on

Nodeoperator AI is built on a constraint-driven model, not open-ended automation:

  • Actions follow deterministic workflows

  • Operational boundaries are explicitly defined

  • Changes are delivered via GitOps (not direct mutations)

  • The agent uses domain-specific infrastructure knowledge

  • Human approval remains part of the control loop

When sandboxed, scoped, and supervised, AI agents can reduce human error and execute repetitive operational tasks with higher consistency than manual workflows.

System Architecture

Nodeoperator AI is built as modular services:

Interfaces

Where operators interact with the system. Designed to fit existing workflows rather than forcing a chat-only model.

Ponos

Ponos is the command interface for Nodeoperator AI.

Ponos (Greek: Πόνος) means toil, labor, or sustained effort. Ponos takes on that toil for node operators.

Available today:

Interface
Use Case

TUI

Interactive terminal UI with workflow progress cards, real-time logs, and natural language input

Slack

Chat interface for team workflows via natural language, slash commands, and threaded conversations

TUI Features:

  • Natural language command input

  • Session history and resume capability

Slack Features:

  • Natural language chat interface

  • Slash commands for common operations

  • Thread-based conversations for follow-ups

  • Alert response integration

  • Team visibility into operations

Planned interfaces:

  • GitHub Actions — Trigger workflows from CI/CD pipelines

  • GitHub Comments — Operate via PR/issue comments

  • Discord — Community and team workflows

Agent Core (Backend)

This is the agent's decision engine, where context logic, safety guardrails, and operational intelligence are enforced.

  • Workflow orchestration — Manages multi-step operations with checkpoints and rollback capability

  • Session management — Maintains conversation context and execution state across interactions

  • LLM integration — Supports Claude and GPT-4 with streaming responses

  • Safety guardrails — Validates actions against operational rules before execution

  • Rulebook engine — Applies team-defined playbooks and constraints to agent decisions

  • Memory system — Stores and retrieves operational knowledge for context-aware responses

MCP Servers

MCP (Model Context Protocol) servers are modular connectors to external systems. They are separated to allow teams to run their own servers, control credentials, and minimize trust assumptions.

Server
Purpose

GitHub MCP

Create PRs, manage issues, fetch releases, repository operations

Kubernetes MCP

Query pods, fetch logs, read deployments, cluster operations

Slack MCP

Read/send messages, manage threads, chat interface integration

Telescope MCP

Privacy-preserving observability for blockchain infrastructure

Blockchain MCP

Protocol-specific tooling for chain interactions

Key design principles:

  • Self-hostable — Run MCP servers in your own environment

  • Credential isolation — Each server manages its own secrets

  • Minimal trust — The agent only has access to what you explicitly connect

  • Auditable — All MCP calls are logged

All MCP servers are open source: https://github.com/blockopsnetwork/mcp-servers

Core Workflows & Capabilities

Ponos supports three core workflows:

1. Upgrade Workflow

Upgrade blockchain clients and infrastructure components with automated changelog analysis.

  • Supported clients: Ethereum execution/consensus clients (Geth, Prysm, Lighthouse, Teku, Nimbus), EVM chains, Polkadot, Cosmos, and Solana (experimental)

  • What it does:

    • Fetches latest releases from upstream repositories

    • Analyzes changelogs and identifies breaking changes

    • Compares current vs target versions

    • Generates upgrade PR with AI-summarized release notes

Example prompts:

  • "Upgrade mainnet Geth to the latest version"

  • "Show me available Lighthouse versions for testnet"

  • "Upgrade all Ethereum clients on holesky to latest stable"

2. Diagnose Workflow

Investigate node failures using logs, metrics, and cluster state to determine root causes.

  • What it does:

    • Collects pod logs and Kubernetes events

    • Queries Prometheus/Grafana metrics

    • Performs root cause analysis (RCA)

    • Creates GitHub issues with findings

    • Generates fix PRs for common issues (e.g., memory limits, config errors)

Example prompts:

  • "Diagnose mainnet Ethereum validators"

  • "Check why Geth pods are failing on testnet"

  • "Investigate high attestation miss rate on validator-01"

Features

GitOps-First Operations

Upgrade Nodes Through Pull Requests Client upgrades are proposed via GitOps with version and release awareness. Every upgrade includes AI-generated changelog summaries, breaking change detection, and rollback instructions.

Operate Through Git, Not Direct Access Infrastructure is never mutated directly — all changes go through reviewable PRs. This provides a complete audit trail, enables team review, and allows easy rollbacks.

Intelligent Diagnostics

Root Cause Analysis (RCA) When nodes fail, the agent correlates logs, metrics, and Kubernetes state to identify the root cause. Findings are documented in GitHub issues with actionable recommendations.

Automated Fix Generation For common issues (OOM kills, resource limits, configuration errors), the agent generates fix PRs automatically. Human approval is still required before changes are applied.

AI Capabilities

Natural Language Interface Describe what you want in plain English. The agent interprets your intent and executes the appropriate workflow.

Context-Aware Sessions The agent remembers conversation context. Follow-up questions like "now do the same for testnet" work without repeating the full context.

Multi-Model Support Works with Claude and GPT-4. Choose the model that fits your needs and budget.

Operational Safety

Keep Secrets Out of Outputs Sensitive values (API keys, passwords, private keys) are automatically redacted and never exposed in logs, PRs, or agent responses.

Enforce Operational Guardrails Actions are validated against safety rules before execution. The agent cannot perform destructive operations without explicit approval.

Rulebooks Define operational playbooks that the agent must follow. Rulebooks encode your team's best practices and constraints.

Observability & Tracking

Real-Time Progress Workflows display live progress in the TUI. See exactly what the agent is doing at each step.

Execution History All sessions are logged with checkpoints. Resume failed workflows or replay past operations.

Session Continuity If a workflow fails, you can resume from the last checkpoint instead of starting over.

Integration & Extensibility

Work From Your Existing Tools Run operations from the Ponos terminal interface, Slack, or automation workflows (GitHub Actions coming soon).

Integrate With Your Stack Connects to GitHub, Kubernetes, Prometheus, Grafana, and blockchain networks via MCP servers.

Run It in Your Own Environment MCP servers are open source and self-hostable. You control credentials, network access, and trust boundaries.

Multi-Chain Support

Ethereum Ecosystem Full support for execution clients (Geth, Nethermind, Besu, Erigon) and consensus clients (Prysm, Lighthouse, Teku, Nimbus, Lodestar).

Other Networks Polkadot, Cosmos, and Solana support (experimental). The architecture is designed to be chain-agnostic.

Last updated