Skip to content

Cluster Health Dashboard

The dashboard is the first screen you see after logging into the Skipper web console. It provides a real-time overview of your cluster's health, showing the status of every infrastructure component, node resource usage, and any recent stability issues.

What the dashboard shows

Component status

The dashboard monitors 13 infrastructure components that make up a Skipper cluster:

ComponentWhat it does
k3sLightweight Kubernetes distribution
TraefikIngress controller and reverse proxy
cert-managerAutomatic TLS certificate management
LonghornDistributed block storage
KEDAEvent-driven autoscaling
LokiLog aggregation
PromtailLog collection agent
PrometheusMetrics collection and alerting
GrafanaMetrics visualisation
VeleroCluster backup and restore
DexIdentity and authentication provider
Console APIBackend API for the web console
ConsoleThe web console frontend

Each component shows a green or red indicator:

  • Green means the component has at least one healthy replica running.
  • Red means the component is not found or has zero available replicas.

Node information

The nodes table shows every node in your cluster with:

  • Status: whether the node is Ready or NotReady.
  • CPU usage: current CPU consumption reported by metrics-server.
  • Memory usage: current memory consumption reported by metrics-server.
  • Disk usage: reported when available, otherwise shown as "n/a".

CPU and memory values require metrics-server to be installed (it is included in a standard Skipper installation). If metrics-server is unavailable, the values display as "n/a".

OOM-killed pods

An amber warning banner appears if any pod was terminated due to running out of memory (OOMKilled) in the last 24 hours. Each entry shows the pod name, namespace, and the time the kill occurred.

OOM kills typically indicate that an application needs more memory than its resource limit allows. To resolve this:

  1. Open the app's resource settings in the console or run kip resources set.
  2. Increase the memory limit.
  3. Monitor the dashboard to confirm the kills stop.

Node resource pressure

Two cards show real-time memory and CPU utilisation for the cluster:

  • Utilisation bar: colour-coded green (<70%), amber (70-85%), or red (>85%)
  • Sparkline chart: a trend line showing the last hour of usage at a glance
  • Totals: current usage vs allocatable capacity

When memory exceeds 80%, the resource controller generates a warning alert with the top consumers and any anomalies. At 90%+, alerts are marked critical.

A table showing every Skipper-managed workload with:

  • Current memory: how much the workload is using right now
  • Sparkline: a mini trend chart of the last hour. Blue means stable, amber means growing, red means anomaly
  • Change indicator: percentage growth compared to ~10 minutes ago. Workloads that grew more than 30% are flagged as anomalies and highlighted in red

Anomalies are sorted to the top of the list, making it easy to spot a workload that is leaking memory or experiencing unexpected load growth before the node runs out of resources.

Resource management mode

The dashboard displays the current resource management mode (Auto or Expert) in the header area. In auto mode, the resource controller runs in the background and adjusts CPU/memory for your apps based on actual usage. A summary of recent auto-mode changes is shown on the dashboard when available. See Resource Management for details on how the controller works and how to switch between modes.

Auto-refresh

The dashboard refreshes automatically every 30 seconds. You can also click the Refresh button to fetch the latest data immediately.

Troubleshooting

If a component shows red, check the following:

  1. Verify the component is deployed. Run kubectl get pods -n <namespace> to see if the pod exists.
  2. Check pod logs. Run kubectl logs -n <namespace> <pod-name> for error messages.
  3. Check events. Run kubectl get events -n <namespace> --sort-by=.lastTimestamp for recent issues.
  4. Restart the component. Run kubectl rollout restart deployment/<name> -n <namespace> if the pod is stuck.

If all components show red and you have just installed the cluster, wait a few minutes for everything to start. Initial provisioning can take 2-5 minutes depending on your server.

Released under the Apache 2.0 License.