Site Title

How We Built a One-Command Health Check for Kubernetes Clusters

Linkedin
x
x

How We Built a One-Command Health Check for Kubernetes Clusters

Publish date

Publish date

In fast-moving environments, it’s easy to assume Kubernetes is fine as long as workloads are running. But when real issues surface—like stale deployments, failed pods, or node reboots—assumptions break down quickly.

At Optimum Partners, we recently tackled this challenge in a single-node Kubernetes cluster running key observability components. The cluster was small, but the risk of invisible failure modes was real.

So we built something deceptively simple: a script.
One command. Full cluster visibility. Designed for humans.

Here’s what we learned.

The Visibility Gap in Kubernetes

Kubernetes gives you tools to see everything—but no default way to see it all at once.

In our setup, developers and support staff needed fast answers to common questions:

  • Which pods are down or in error state?
  • Are services up across all namespaces?
  • Has the node rebooted recently?
  • Are there deprecated resources we missed?

Without dashboards or external tools, that meant jumping between kubectl commands, grepping outputs, and manually correlating data.

We wanted to compress all of that into one clean interface—with context.

What We Built: A Fast, Readable Health Snapshot

We created a Bash script that combines standard kubectl queries with process-level insight from the host node.

The script delivers a live cluster snapshot with:

  • ✅ Verified kubectl connectivity
  • 🧠 Node uptime, version, and readiness
  • 📦 Status of all deployments, pods, services, PVCs, and custom resources
  • 🌐 Ingress configurations
  • 🗂 Recent events across namespaces (last 20–50)
  • 💾 Optional save-to-file with timestamped logs for RCA or audit

Each section is color-coded and well-formatted for scanning large outputs during incident triage.

Technical Highlights

We kept it minimal—but powerful:

  • Used kubectl get + JSONPath to iterate through namespaces and resource types
  • Pulled node uptime using systemctl show -p ActiveEnterTimestamp kubelet via SSH
  • Displayed and optionally saved output using tee
  • Structured the report using temporary files for clean export

No dependencies. No dashboards.
Just structured shell scripting with discipline.

What Changed: Real Outcomes, Not Just Outputs

This wasn’t about better visuals—it was about better operational control. Here’s what improved:

🚀 Faster Triage

Teams can now run one command during incidents and immediately see degraded states, stale resources, or recent restarts.

🙌 Team Autonomy

Developers no longer need DevOps help to validate cluster health. They run checks themselves before escalating.

🕵️ Snapshot-Based RCA

Saved reports act as timestamped snapshots—useful for retrospective analysis, incident reviews, or internal audits.

🔄 Context Around Restarts

By surfacing node start time, we quickly correlate incidents with reboots or kernel-level changes.

Key Takeaways

  • Visibility scales with clarity, not just tooling. Even single-node clusters benefit from structured health checks.
  • Command-line output can be just as operational as a dashboard—when it’s clean, contextual, and shareable.
  • Simple tooling frees up DevOps bandwidth and empowers developers to act faster.

This script isn’t a monitoring replacement—it’s a visibility multiplier.

In high-velocity environments, that’s often the edge that matters most.

Related Insights

AI Security Architecture: Implementing Workload Identity Federation (WIF) and SPIFFE

In October 2024, the Internet Archive—the digital memory of the web—suffered a catastrophic breach. It wasn’t a zero-day exploit. It was a GitLab authentication token that had been hardcoded in a configuration file back in December 2022. For nearly two years, that "Non-Human Identity" sat dormant, unrotated, and fully privileged. When attackers found it, they didn't just get access; they got the keys to the kingdom.

Your AI Vendor Is Your Biggest Competitive Threat

Every time your team prompts a public AI model with a real client situation, a real underwriting decision, or a real exception that does not fit the standard process, that logic goes somewhere. It trains the model. The model belongs to your vendor. And your vendor now understands your industry better than any competitor did two years ago.

The Actuation Layer: Bridging the "Reality Gap" between Digital Agents and Physical Assets

In the "Architectural Winter" of early 2026, the industry has realized that a "Logic Core" is useless if it cannot move the world. We are transitioning from Digital Agents (those that move pixels and tokens) to Physical AI (those that move pallets, valves, and surgical arms).

Working on something similar?​

We’ve helped teams ship smarter in AI, DevOps, product, and more. Let’s talk.

Stay Ahead of the Curve in Tech & AI!

Actionable insights across AI, DevOps, Product, Security & more