Agent Self-Healing

Overview

The self-healing system automatically detects and resolves infrastructure issues without human intervention. It integrates with Prometheus for monitoring and uses AI agents for diagnosis and repair.

Healing Pipeline

Prometheus Alert → Alertmanager Webhook → Agent Runtime → Responsible Agent → Heal()

Step 1: Detection

Prometheus monitors all managed resources and fires alerts when thresholds are breached:

# Example Prometheus alert rule
alert: InstanceDown
  expr: up{job="instance-health"} == 0
  for: 2m
  labels:
    severity: critical
    resource_kind: Instance
  annotations:
    summary: "Instance {{ $labels.instance }} is down"

Step 2: Webhook

Alertmanager sends the alert to the Agent Runtime via webhook:

{
  "alerts": [
    {
      "status": "firing",
      "labels": {
        "alertname": "InstanceDown",
        "resource_kind": "Instance",
        "resource_id": "inst-abc123",
        "severity": "critical"
      }
    }
  ]
}

Step 3: Dispatch

The Agent Runtime maps the resource_kind to the responsible agent and calls its Heal() method with the issue details.

Step 4: Diagnosis

The agent uses the LLM to diagnose the issue by analyzing available data:

Resource status and recent state changes
Metrics history (CPU, memory, disk, network)
Recent logs and events
Provider-level information

Step 5: Repair

Based on the diagnosis, the agent generates and executes a repair plan using its tools.

Issue Types

Issue Type	Description	Typical Repair Actions
`service_down`	Service is unresponsive or crashed	Restart service, reboot instance, failover
`disk_full`	Disk usage exceeds threshold	Clean old logs, rotate files, expand volume
`high_cpu`	Sustained high CPU utilization	Identify runaway process, scale up, add instances
`replication_lag`	Database replica significantly behind	Restart replication, rebuild replica
`connection_timeout`	Cannot establish connection	Check security groups, restart networking, DNS check

## Healing Policies

Healing behavior is configurable per resource kind and issue type:

{
  "resource_kind": "Instance",
  "issue_type": "service_down",
  "auto_heal": true,
  "max_retries": 3
}

When auto_heal is true, the agent attempts repair automatically up to max_retries times. After exhausting retries, the issue is escalated for human intervention.

When auto_heal is false, the agent generates a repair plan but waits for human approval before executing it.

Healing Example

Alert: InstanceDown for inst-abc123
Agent: instance-agent
Diagnosis: SSH connection failed. Provider reports instance is running.
           Network interface may be misconfigured.
Plan:
  1. Attempt soft reboot via provider API
  2. Wait 60s for instance to come back
  3. Verify connectivity via health check
Result: Instance recovered after soft reboot. Total healing time: 90s.