E2e Test Race Condition With ClusterResource Sets

by Jule 50 views
E2e Test Race Condition With ClusterResource Sets

Running make test-e2e on a Proxmox-provided cluster often stalls at control plane boot - Cluster-api finishes setup before all Proxmox nodes are ready. The root? A race condition in clusterresourceset initialization, where the system marks nodes as ready too early. This skips critical setup steps like Calico CNI injection, leaving the cluster stuck. Here’s the breakdown:

  • The cluster deploys infrastructure but fails to spin up all control plane nodes
  • Clusterresourceset reports completion prematurely
  • Calico never installs, and e2e tests hang until failure

Psychologically, this reflects a broader tension in distributed systems: speed vs safety. Kubernetes teams prioritize rapid deployment, but in Proxmox environments, early completion flags can override real readiness. This isn’t just a bug - it’s a human factor: teams expect instant results, not sequential handshakes. Think of it like handing off a relay baton too soon.

Hidden details often reveal blind spots: clusterresourceset completion is a flag, not a finish line. It doesn’t verify node health or network stability. Proxmox machines may be composite virtual nodes, complicating readiness checks. A quick double-check with kubectl get nodes post-resource set often uncovers unreachable pods or stalled pods.

But there is a catch: skipping node readiness verification risks flaky monitoring and false test passes. To stay safe, treat clusterresource set completion as a starting point, not a green light. Always validate node status before proceeding. For better control, consider using host network CNI during init or deploying with an operator that waits for full node boot. These tweaks prevent silent hangs and build trust in end-to-end pipelines.

Is your cluster ready - before the test does? The real race is getting timing right, not just speed.