Facebook outage caused by error during maintenance work, company says

The social network went offline for more than five hours on Monday
Billions of the platforms’ users had been left unable to get online (Nick Ansell/PA)
PA Archive
Martyn Landi6 October 2021

The Facebook outage which took the social network, as well as Instagram and WhatsApp, offline for more than five hours was caused by an error during a routine maintenance job, the company has said.

Billions of the platforms’ users had been left unable to get online on Monday by the fault, which the company said was “an outage caused not by malicious activity, but an error of our own making”.

Santosh Janardhan, Facebook’s vice president of infrastructure, said that during what was “routine maintenance work” on the firm’s backbone network “a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centres globally”.

Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command

Santosh Janardhan, Facebook

Writing in a blog post he said: “Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command.

“This change caused a complete disconnection of our server connections between our data centres and the internet. And that total loss of connection caused a second issue that made things worse.”

Mr Janardhan said it also took time to fix because of the way Facebook’s servers are designed, in order to offer better physical security.

“They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them,” he said.

He confirmed that Facebook then had to bring the servers back online slowly, to avoid any further issues.

“We knew that flipping our services back on all at once could potentially cause a new round of crashes due to a surge in traffic,” he said.

“Every failure like this is an opportunity to learn and get better, and there’s plenty for us to learn from this one.

“After every issue, small and large, we do an extensive review process to understand how we can make our systems more resilient. That process is already under way.”

As well as sparking debate about the public use of social media, the outage also saw EU competition commissioner Margrethe Vestager repeat calls for greater competition in the tech sector – saying the incident highlighted the negative impact of big tech firms controlling large swathes of the online world.

“We need alternatives and choices in the tech market, and must not rely on a few big players, whoever they are,” she wrote on Twitter.

Create a FREE account to continue reading

eros

Registration is a free and easy way to support our journalism.

Join our community where you can: comment on stories; sign up to newsletters; enter competitions and access content on our app.

Your email address

Must be at least 6 characters, include an upper and lower case character and a number

You must be at least 18 years old to create an account

* Required fields

Already have an account? SIGN IN

By clicking Create Account you confirm that your data has been entered correctly and you have read and agree to our Terms of use , Cookie policy and Privacy policy .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Thank you for registering

Please refresh the page or navigate to another page on the site to be automatically logged in