The Eclipse Foundation jumped on the Kubernetes bandwagon a few years ago, for the same reasons as everyone else. Our need at the time was for a scalable/fault tolerant solution for our Jenkins-based CI system. We started small by repurposing older hardware, and with early successes, the cluster grew out of a combination of new iron and more repurposed servers.
As cluster usage grew, we started receiving feedback about slow and fluctuating build times. You see, we'd typically purchase hardware for low I/O, parallel operations - lots of CPU cores, lots of RAM, for those hundreds of web requests per second we typically handle. As it turns out, although these machines can handle thousands of simultaneous short-lived connections, they suck at single-threaded operations that run for 20 minutes.
For new hardware, we moved away from the "big iron" model, choosing instead smaller units with fewer CPU cores, but much faster ones, and faster memory busses. To save money up-front, we'd equip them with inexpensive HDDs, with the understanding that local I/O wasn't much of a thing. Now, I know what you're thinking: HDDs? Duh, get SSDs, it's a no-brainer. Our release engineers Mikaël and Fred have been pleading the case for SSDs in build machines for years. I had simply underestimated the impact of local disk I/O - either when multiple disk-intensive pods were scheduled at the same time, or when images were pulled to the local node for spin-up. An 8-minute build was a 27-minute build later on, for no obvious reasons.
We've since been retrofitting all our worker nodes with SSDs with, obviously, much success. The fast machines are now fast -- consistently, and the older iron performs adequately well. And with some benchmarking (thanks for the data and image, Mikaël Barbero), we're able to identify worker nodes that are simply outclassed, such as third-from-the-left "okdnode-12", which is headed for a permanent retirement.
With the targeted use of labels, we can reserve older hardware for typical website applications where slower core speed is appropriate for those short-lived connections.
We're doing our best to provide Eclipse projects with a reliable, performant, expandable and consistent platform for running builds, without breaking the bank. It's a learning curve for sure, and we're getting there.