Agent Self-Healing

Self-healing pipeline from alert detection to autonomous repair.

Agent Self-Healing

Overview

The self-healing system automatically detects and resolves infrastructure issues without human intervention. It integrates with Prometheus for monitoring and uses AI agents for diagnosis and repair.

Healing Pipeline

Prometheus Alert → Alertmanager Webhook → Agent Runtime → Responsible Agent → Heal()

Step 1: Detection

Prometheus monitors all managed resources and fires alerts when thresholds are breached:

# Example Prometheus alert rule
  • alert: InstanceDown
expr: up{job="instance-health"} == 0 for: 2m labels: severity: critical resource_kind: Instance annotations: summary: "Instance {{ $labels.instance }} is down"

Step 2: Webhook

Alertmanager sends the alert to the Agent Runtime via webhook:

{
  "alerts": [
    {
      "status": "firing",
      "labels": {
        "alertname": "InstanceDown",
        "resource_kind": "Instance",
        "resource_id": "inst-abc123",
        "severity": "critical"
      }
    }
  ]
}

Step 3: Dispatch

The Agent Runtime maps the resource_kind to the responsible agent and calls its Heal() method with the issue details.

Step 4: Diagnosis

The agent uses the LLM to diagnose the issue by analyzing available data:

  • Resource status and recent state changes
  • Metrics history (CPU, memory, disk, network)
  • Recent logs and events
  • Provider-level information

Step 5: Repair

Based on the diagnosis, the agent generates and executes a repair plan using its tools.

Issue Types

Issue TypeDescriptionTypical Repair Actions
service_downService is unresponsive or crashedRestart service, reboot instance, failover
disk_fullDisk usage exceeds thresholdClean old logs, rotate files, expand volume
high_cpuSustained high CPU utilizationIdentify runaway process, scale up, add instances
replication_lagDatabase replica significantly behindRestart replication, rebuild replica
connection_timeoutCannot establish connectionCheck security groups, restart networking, DNS check
## Healing Policies

Healing behavior is configurable per resource kind and issue type:

{
  "resource_kind": "Instance",
  "issue_type": "service_down",
  "auto_heal": true,
  "max_retries": 3
}

When auto_heal is true, the agent attempts repair automatically up to max_retries times. After exhausting retries, the issue is escalated for human intervention.

When auto_heal is false, the agent generates a repair plan but waits for human approval before executing it.

Healing Example

Alert: InstanceDown for inst-abc123
Agent: instance-agent
Diagnosis: SSH connection failed. Provider reports instance is running.
           Network interface may be misconfigured.
Plan:
  1. Attempt soft reboot via provider API
  2. Wait 60s for instance to come back
  3. Verify connectivity via health check
Result: Instance recovered after soft reboot. Total healing time: 90s.