I was recently reading this post telling everyone (strenuously) to turn off CPU limits in #Kubernetes. I could not disagree more, for most production environments.

There are a few caveats where I do think it makes sense. If you have all your teams:

  • That are extremely aware of the performance characteristics of all of their services.
  • Have benchmarks that will show a CPU performance degradation before it ships.
  • Have monitoring and alerting when their services are regularly running over their requested CPU, and have the processes in place to take action on that.

I doubt that most people are in that situation. Maybe in a large company where a team has 1-2 services to manage. 

The problem for 95% of everyone else is that allowing services to use "free CPU" means you don't really have any forcing function when a change hogs up a bunch of CPU. You also end up with Heisenbugs that only happen when a certain set of services happen to be co-deployed on the same nodes and a certain situation occurs.

27 years in the industry–many of them in ops–tells me that the money saved from widely over-subscribing CPUs is not worth the developer and ops time required to debug and support these things. And most organizations don't have the built-in maturity to have that make sense. CPU is expensive. But dev time is much more so.

So, "for the love of God" as they say in the post, please use both requests and limits unless you can check off all those points above.