Ephemeral Clusters

Ephemeral clusters can solve a huge number of challenges often faced by ops teams. The possibilities they open up are incredibly valuable in terms of agility and stability, so it’s useful to understand them.

Ephemeral clusters are short-lived clusters, where “short” could be anything from less than an hour to a month or so. For them to be effective they should be stateless and immutable, at least in terms of their own configurations. So while app teams might deploy multiple times onto an ephemeral cluster before it’s torn down, any security updates/config changes to the actual cluster shouldn’t be done in-place. Instead, a new ephemeral cluster should be created, tested and brought into service, and the old one torn down.

The ability to create multiple, fully-provisioned ephemeral clusters creates some very useful capabilities:

  • Each developer or team could have their own dev cluster, and tear them down when they’re done.
  • Testing/staging clusters can be spun up and torn down, either before each prod deployment or daily/weekly depending on your requirements.
  • Frequently test your disaster recovery processes for no extra effort. If you’re regularly creating and tearing down your clusters you’ll reduce the risk of uncommitted adhoc/manual changes tainting your cluster, giving you more confidence you’d be able to recover from a disaster.
  • Simplify going multi-region - if you’re already deploying your cluster to one cloud region it’s simple to deploy to another if you’ve followed the best practices.
  • Ease cluster upgrades - you could bring up a new instance of your prod cluster in a non-live account to test a Kubernetes rolling upgrade before applying it in prod, or go one step further and use a blue/green release process. In that case you could create an entirely new prod cluster in your prod account and gradually shift more and more traffic onto it. If all goes well you could tear down your old cluster, or if not just redirect all traffic back to the original. Of course, this would require something at the edge to control sending percentages of traffic to each target cluster.
  • Aid compliance with e.g. PCI requirements by making your deliverable artefact your entire cluster. When dealing with PCI regulations the less that’s running in your cluster and the less connectivity it has elsewhere, the better. Kapps make it simple to build new clusters with specific software installed making it easier to replicate parts of your architecture instead of sharing it if necessary.
  • Test long-lived cluster upgrades. For example if you have several terabytes of log data in your monitoring cluster you’d probably want the cluster to be long-lived instead of being ephemeral. Ephemeral clusters could still be useful though because you could use them to bring up a test cluster with the same versions of software installed. You could then copy over a subset of your logging data and test the update you plan to carry out.