Incident Debugging - Akua

An agent asked “why is my app returning 500s?” runs a full incident triage: checking pod health, recent Kubernetes events, error logs, and deployment rollout history, all composed across multiple execute calls, reasoning about each result before deciding what to check next.

The triage flow

This isn’t a single code block. It’s how the agent thinks. Each step is one execute call, but the agent decides what to check based on what it finds.

Step 1: Pod health check

async () => {
  const clusterId = "cls_abc123"; // resolved by the agent from conversation
  const namespace = "production"; // resolved by the agent from conversation

  const kube = (path) => cnap.request({
    method: "GET",
    path: `/v1/clusters/${clusterId}/kube_proxy/${path}`,
  }).then(r => r.body);

  const pods = await kube(`api/v1/namespaces/${namespace}/pods`);

  return pods.items.map(p => ({
    name: p.metadata.name,
    phase: p.status.phase,
    restarts: p.status.containerStatuses?.reduce((s, c) => s + c.restartCount, 0) || 0,
    ready: p.status.containerStatuses?.every(c => c.ready) || false,
    containers: p.status.containerStatuses?.map(c => ({
      name: c.name,
      ready: c.ready,
      restarts: c.restartCount,
      state: Object.keys(c.state || {})[0],
      reason: c.state?.waiting?.reason || c.state?.terminated?.reason || null,
    })),
  }));
}

The agent sees a pod in CrashLoopBackOff with 12 restarts. It decides to check events and logs.

Step 2: Recent events

async () => {
  const clusterId = "cls_abc123"; // resolved by the agent from conversation
  const namespace = "production"; // from step 1
  const podName = "api-proxy-7f8b4c..."; // from step 1 results

  const kube = (path) => cnap.request({
    method: "GET",
    path: `/v1/clusters/${clusterId}/kube_proxy/${path}`,
  }).then(r => r.body);

  const events = await kube(
    `api/v1/namespaces/${namespace}/events?fieldSelector=involvedObject.name=${podName}`
  );

  // Sort by last timestamp, return most recent
  return events.items
    .sort((a, b) => new Date(b.lastTimestamp) - new Date(a.lastTimestamp))
    .slice(0, 15)
    .map(e => ({
      type: e.type,
      reason: e.reason,
      message: e.message,
      count: e.count,
      last: e.lastTimestamp,
    }));
}

Events show OOMKilled. The container ran out of memory. The agent checks logs to confirm.

Step 3: Error logs

async () => {
  const clusterId = "cls_abc123"; // resolved by the agent from conversation
  const namespace = "production"; // from step 1
  const podName = "api-proxy-7f8b4c..."; // from step 1 results

  const logs = await cnap.request({
    method: "GET",
    path: `/v1/clusters/${clusterId}/kube_proxy/api/v1/namespaces/${namespace}/pods/${podName}/log`,
    query: { tailLines: "200", previous: "true" },
  }).then(r => r.body);

  // Filter for errors and warnings
  const lines = logs.split("\n");
  const errors = lines.filter(l =>
    /error|fatal|panic|exception|oom|killed/i.test(l)
  );

  return {
    total_lines: lines.length,
    error_lines: errors.length,
    errors: errors.slice(-20),
  };
}

Note previous: "true": the agent fetches logs from the crashed container, not the restarting one. It finds memory allocation failures in the last 20 error lines.

Step 4: Deployment rollout history

async () => {
  const clusterId = "cls_abc123"; // resolved by the agent from conversation
  const namespace = "production"; // from step 1
  const deploymentName = "api-proxy"; // from step 1 results

  const kube = (path) => cnap.request({
    method: "GET",
    path: `/v1/clusters/${clusterId}/kube_proxy/${path}`,
  }).then(r => r.body);

  const [deployment, replicaSets] = await Promise.all([
    kube(`apis/apps/v1/namespaces/${namespace}/deployments/${deploymentName}`),
    kube(`apis/apps/v1/namespaces/${namespace}/replicasets`),
  ]);

  // Find ReplicaSets owned by this deployment
  const owned = replicaSets.items
    .filter(rs => rs.metadata.ownerReferences?.some(o => o.name === deploymentName))
    .sort((a, b) => parseInt(b.metadata.annotations?.["deployment.kubernetes.io/revision"] || "0")
                   - parseInt(a.metadata.annotations?.["deployment.kubernetes.io/revision"] || "0"));

  return {
    current_image: deployment.spec.template.spec.containers[0]?.image,
    current_limits: deployment.spec.template.spec.containers[0]?.resources?.limits,
    revisions: owned.slice(0, 5).map(rs => ({
      revision: rs.metadata.annotations?.["deployment.kubernetes.io/revision"],
      image: rs.spec.template.spec.containers[0]?.image,
      replicas: rs.status.replicas,
      created: rs.metadata.creationTimestamp,
    })),
  };
}

The agent finds that the latest revision changed the image but removed memory limits. Root cause identified.

Why this matters

An SRE manually doing this would:

kubectl get pods to check status
kubectl describe pod to read events
kubectl logs --previous to check crash logs
kubectl rollout history to check what changed

That’s 4 separate commands with raw output they need to mentally parse. The agent does it in 4 execute calls, but each one filters and extracts only what’s relevant. The LLM reasons about structured findings, not walls of YAML. More importantly, the agent adapts. It doesn’t run a fixed checklist. It sees OOMKilled and decides to check previous container logs and deployment history. A traditional MCP tool would need a pre-built “debug pod” tool that tries to anticipate every scenario.

Kubernetes access

The kube proxy and exec endpoints used in each triage step.

Security audit

Proactive security checks before incidents occur.

Parallel log analysis

Fetch and count logs across all pods in a single call.

Hosted agents

Ambient agents that start triage automatically on deploy failures.

Documentation Index

​The triage flow

​Step 1: Pod health check

​Step 2: Recent events

​Step 3: Error logs

​Step 4: Deployment rollout history

​Why this matters

​Related topics