Facebook’s largest outage in history was caused by a wrong command that resulted in what the social media giant said was “an error of our own making”.
We’ve done extensive work hardening our systems to prevent unauthorised access, and it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making,” said the new post published on Tuesday.
Santosh Janardhan, Facebook’s vice president of engineering and infrastructure, explained in the post why and how the six-hour shutdown occurred and the technical, physical and security challenges the company’s engineers faced in restoring services.
The primary reason for the outage was a wrong command during routine maintenance work, according to Mr Janardhan.
Facebook’s engineers were forced to physically access data centres that form the “global backbone network” and overcome several hurdles in fixing the error caused by the wrong command.
Once these errors were fixed, however, another challenge was thrown at them, in the form of managing a “surge in traffic” that would come as a result of fixing the problems.
Mr Janardhan, in the post, explained how the error was triggered “by the system that manages our global backbone network capacity.”
“The backbone is the network Facebook has built to connect all our computing facilities together, which consists of tens of thousands of miles of fibre-optic cables crossing the globe and linking all our data centres,” the post said.
The entirety of Facebook’s user requests, including loading up news feeds or accessing messages, is dealt with from this network, which handles requests from smaller data centres.
To effectively manage these centres, engineers perform day-to-day infrastructure maintenance, including taking part of the “backbone” offline, adding more capacity or updating software on routers that manage all the data traffic.
“This was the source of yesterday’s outage,” Mr Janardhan said.
“During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centres globally,” he added.
What complicated matters was that the erroneous command that caused the outage could not be audited because a bug in the company’s audit tool prevented it from stopping the command, said the post.
A “complete disconnection” between Facebook’s data centres and the internet then happened, something that “caused a second issue that made things worse.”
The entirety of Facebook’s “backbone” was removed from operation, making data centre locations designate themselves as “unhealthy”.
“The end result was that our DNS servers became unreachable even though they were still operational,” said the post.
Domain Name Systems (DNS) are systems through which web page addresses typed by users are translated into Internet Protocol (IP) addresses that can be read by machines.
“This made it impossible for the rest of the internet to find our servers.”
Mr Janardhan said this gave rise to two challenges. The first was that Facebook’s engineers could not access the data centres through normal means because of the network disruption.
The second was the company’s internal tools that it normally uses to address such issues were rendered “broke”.
The engineers were forced to go onsite to these data centres, where they would have to “debug the issue and restart the systems”.
This, however, did not prove to be an easy task, because Facebook’s data centres have significant physical and security covers that are designed to be “hard to get into”.
Mr Janardhan pointed out how the company’s routers and hardware were designed so that they are difficult to modify, despite physical access.
“So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online,” he said.
Engineers then faced a final hurdle – they could not simply restore access to all users worldwide, because the surge in traffic could result in more crashes. Reversing the vast dips in power usage by the data centres could also put “everything from electrical systems to caches at risk”.
“Storm drills” previously conducted by the company meant they knew how to bring systems back online slowly and safely, the post said.
“I believe a tradeoff like this is worth it – greatly increased day-to-day security vs a slower recovery from a hopefully rare event like this,” Mr Janardhan concluded.
Facebook’s outage – which impacted all its services including Whatsapp and Instagram – led to a personal loss of around $7bn for chief executive Mark Zuckerberg as the company’s stock value dropped. Mr Zuckerberg has apologised to users for any inconvenience the break in service caused.