• Share this article:

Anatomy of a server failure

Tuesday, February 11, 2020 - 09:51 by Denis Roy

Last Friday, Feb 7 at around 12:30pm (Ottawa time), I received a notification from Fred Gurr (part of our release engineering team) that something was going on with the infra. The multitude of colours on the Eclipse Service Status page confirmed it -- many of our services and tools were either slow, or unresponsive.

After some initial digging, we discovered that the primary backend file server (housing Git, Gerrit, web session data, and a lot of files for our various web properties) was not responding. It was also host to our accounts database -- the center for all user authentication.

Jumping into action

It's a well-rehearsed routine for colleage Matt Ward and I -- he worked on assessing the problem and identifying the fix, while I worked on Plan B - failover to our hot standby. At around 1:35pm, roughly 1 hour into the outage, Matt made the call -- failover is the only option, as a  hardware component has failed. 20 minutes later, most services had either recovered or were well on their way.

But the failover is not perfect. Data is sync'ed every 2 hours. Account and authentication info is replicated nightly. This was a by-design strategy decision, as it offers us a recovery window in case of data erasure, corruption or unauthenticated access.

Lessons learned

The failed server was put in service in 2011, celebrating its *gasp* ninth year of 24/7 service. That is a few years too many, and although it (and its standby counterpart) were slated for replacement in 2017, the effort was pushed back to make room for competing priorities. In a moment of bitter irony, the failed hardware was planned to be replaced in the second quarter of this year -- mere months away. We gambled with the house, we lost.

Cleaning up

Today, there is much dust to settle. Our authentication database has some gremlins that we need to fix, and there could be a few missing commits that were not replicated.

We also need to source replacement hardware for the failed component, so that we can re-enable our hot standby. At the same time, we need to immediately source replacement servers for those 2011 dinosaurs. They've served us well, but their retirement is long overdue.