Advertisements
Advertisements

Facebook is gone: How did we die that year?

I rarely use Facebook during work hours. After working from home, I usually close Facebook completely because I often have to share my screen during online meetings. Yesterday morning at nine o’clock, a colleague suddenly knocked on me and told me “Facebook is gone.” Even after trying several times, it was true, Facebook was gone.

My browser even kindly reminded me if I spelled “facebook.com” wrong. We are all website operators, and it is inevitable that we will be a little curious and excited about such things. We are all curious about how others died, and this is a learning process. We have died ourselves and have been on CNN; we have seen Amazon die and also saw Facebook die once in 2019. From each different death, experts probably know where the problem lies.

How We Died That Year

Sometimes I wish there was a large international conference called “How We Died That Year” in the world. Websites crash, and everyone keeps it a secret. All successes are shared, but failures are never discussed. As a result, others keep repeating the same mistakes that caused our deaths, while those who haven’t died yet feel complacent… but they don’t know that if they don’t learn, it will soon be their turn.

The world is strange. People can share code, but no one shares their failures. You have to experience failure firsthand to learn and improve. The maturity and stability of a website are learned through blood, sweat, and tears. This is not taught in school, not found in books, and few experts can teach you how to run a website the size of Facebook.

Below is a screenshot of Facebook’s outage that I captured yesterday:

img

Usually, when a website is down, the error message is not like this. The error message usually tells you that the address is found, but the house cannot be entered, or that you can enter the house but not a specific room. This is mostly a server problem, and the address is not the issue. Facebook’s problem is that the whole world says “Sorry, we couldn’t find this address.” You go to facebook.com, and it goes in circles and tells you that the place doesn’t exist.

This is a big deal. Facebook disappeared from the world in an instant.

Below is a screenshot of Amazon’s outage. I don’t need to explain it. The difference between these two screenshots is very obvious. Amazon’s address is correct, but the room cannot be entered. They can at least use the screen to tell you what happened, apologize, and use a little humor to ease your frustration. Facebook’s situation is that the address doesn’t even exist.

img

How could such a big website just disappear like that? How could the traffic flow of tens of millions of times a day suddenly make this address disappear?

Facebook Disappeared

Facebook disappeared from the earth for a total of six hours, and it was a complete disappearance – including subsidiaries and internal networks. All major websites have experienced downtime, but never as clean and long-lasting as this.

First of all, we like to think of “crashes” as hardware problems, but the vast majority of website downtime is not due to hardware. Hardware can certainly die, and it often does, but websites have countless backup machines that can automatically switch over, and hardware problems only affect local and temporary functionality. Nowadays, software design and data flow usually take these factors into account. In well-designed websites, hardware crashes are just like burps. This kind of thing happens every hour on large websites, and most users don’t even know it. A website the size of Facebook may have over a million machines, and we can calculate the probability: if each machine fails once a year, then on average, one hundred servers will fail per hour, and the million-strong army will take turns. If this cannot be resolved, then there is no need to operate the website. Therefore, the vast majority of website problems have nothing to do with hardware crashes.

If it’s not a hardware problem, where does the problem lie? It lies with you and me.

Website problems are like plane crashes, with 70% of them being human factors. I remember watching a series of examples on Netflix called “Why Planes Crash,” and most plane crashes start with small reasons that are not fatal, such as weather, machinery, or other factors, and end with fatal human errors. Many disasters are even caused by multiple human errors in a row.

God usually doesn’t give four chances.

All planes in the world are the same, so why do some airlines always have problems? That’s because of “human factors,” but when you say it’s human factors, it’s actually a management problem, and when you say it’s a management problem, it’s actually a corporate culture problem. You can’t blame the person who made the mistake for such a big problem. What needs to be rectified is the culture and top management.

This time, it’s easy to guess that the problem with Facebook’s complete and clean disappearance was due to a problem with the entry switch. In such a large website, there are layers upon layers of networks, and there must be a set of switches at the outermost point where it connects to the world network, telling the world “Facebook is here.” You can think of it as a lighthouse. But in an instant, that lighthouse went out, and the whole world didn’t know where Facebook was, and it was a domino effect, and a few minutes later, it disappeared from the surface of the earth. As for why the lighthouse went out, no one knows the real story. The news only mentioned that it happened after an internal network update – I suspect it may have been caused by a network engineer directly changing important files on the switch online, causing an error. There are usually some files on these switches that should never be touched.
Once these files have problems, it may cause the website to crash. This happens to almost every website, but it depends on which layer of the switch the problem occurs. The closer it is to the entrance, the greater the impact. Of course, this time the switch that was affected should be the one that connects the entrance to the world, so the impact is comprehensive.

img

The network monitoring center after work shift. Image source: Lu Yu

The best way for a website to be stable is to never make any changes. Because as long as there are changes, errors will occur. If you disassemble a car engine and then reassemble it without replacing any parts, you can guarantee that the engine will not be the same as before. Of course, no website can avoid changes. A large website of this size may have up to a million changes per year. So the key is how to make a million changes a year without any problems? This is a professional problem and requires a very strict management team to execute.

If you think of the website operation engineer industry as an onion, the network engineer is the core of the onion, and their salary is usually the highest. Slowly pushing out layer by layer, the next layer may be the data platform engineer, and then maybe the database engineer… all the way to the outermost layer of the onion. The more the outer layer of the onion changes, the more frequent the changes in technology, but the impact is smaller. The basic engineering of the core, such as the network platform and data platform, rarely changes in hardware, software, and settings, but once it changes, the impact will be very large.

So when doing website operation engineering, it depends on which layer of the onion you choose. As the heart of the onion, don’t touch those things that cannot be touched if there is no need to. Life is pretty good on ordinary days. When it is necessary to touch them, be prepared for the consequences. So I often sympathize with the network engineers at the bottom of the food chain. Although they make a lot of money and the market is optimistic, it is an industry that absolutely cannot make mistakes. Few of my network engineer colleagues have grown old with the company. They either get poached or get fired for making mistakes.

Engineers who maintain the operation of large-scale websites must pass strict “change process training” and certification exams, and the certification is updated once a year. Newcomers cannot touch the server until they obtain certification. All changes on the website must be classified according to risk and follow established procedures step by step. After each step is completed, it needs to be approved by the monitoring personnel next to it. If it is high-risk, it must be avoided during peak hours, signed by VP-level personnel, and there must be a rollback plan. Handling website changes requires military-style management and discipline.
I was surprised that Facebook chose the golden time of 9:00 AM US time to change their entry switch. If we convert Facebook’s daily traffic flow (which I estimate to be around 10-15 terabytes) into peak traffic, it’s about 2-3 billion per second. With such traffic, there should be no reason to approve such a significant change without a good reason. Changing important switch files may not be reversible, and while I’m not an expert in this area, it means that before clicking “confirm,” two or even three people should confirm that there are no problems. Even if you’ve done it a thousand times, you still need to be cautious as the devil is in the details. Almost all major crashes in history have been caused by switch failures.

img

If they didn’t follow this procedure yesterday, the engineer who clicked “confirm” would have been a scapegoat. This is clearly a management issue. Of course, the media also believes that their overall network architecture design has problems. Large-scale websites completely separate their external and internal networks; otherwise, like Facebook, no one can rescue them if there is a problem. This is like putting all your eggs in one basket and then guaranteeing that the basket won’t fall.

In any case, you cannot turn off the entire lighthouse with a single change. Once the lighthouse goes out, it’s impossible to restore it. The focus is not on the content of the change, but on the change process.

This major crash caused the company’s internal network to stop working, and even the network monitoring center and data center could not connect. All monitoring systems were shut down, and no one knew what had happened. Worse still, even the employees in the park could not enter the company. In the end, a group of network engineers had to be sent to the data center to go directly to the faulty switch and fix the error by touching the machine with their hands. At this point, there was no network connection, and even the most skilled engineers in the world couldn’t help.

This is why it took six hours to restore the system.
The Disaster of Pooling Fish

According to some media estimates, yesterday’s major Facebook outage resulted in losses of at least tens of millions of dollars. The loss is not just for Facebook, nor is it just for users like us who can finally take a break from the platform for a day. Many small businesses and websites rely on Facebook to conduct their business. Facebook fan pages serve as storefronts for countless businesses, and many businesses use WhatsApp to receive orders. The global impact of this disaster is incalculable.

In addition, countless websites around the world offer the option to “log in with Facebook.” Now that Facebook is down, these people cannot log in to sites that have nothing to do with this incident. Our login volume dropped by 5% yesterday, which is a relatively minor impact because the vast majority of users have their own accounts. However, smaller websites that rely heavily on Facebook login are more severely affected.

Finally, there is the most unfortunate consequence of this disaster: scammers immediately took advantage of the situation and sent emails pretending to be from Facebook, claiming that the site had resumed operation and asking users to log in again.

Facebook is now in a state of complete disarray. They need to explain this outage to the world, but they don’t even have a platform to speak from. In the end, they were forced to apologize to the world using their competitor, Twitter, and explain that the site had experienced some problems. Twitter took the opportunity to eat Facebook’s tofu, saying “Welcome to Twitter” to the world. With 3 billion users, even if only 1% come to visit, it would be a huge business opportunity.

Twitter is one of the few websites that has profited greatly from Facebook’s disaster.Facebook has a lot to reflect on after this unprecedented outage, and many people should be held accountable. They have always been a giant with the mentality of teenagers in terms of culture and leadership style. Their growth has far exceeded their maturity, and many things that should have been done have not been done.

This type of critical switch change is something that the whole world must face and execute every day, but it only failed in their hands, and failed so completely. They should not just find a few people to take the blame and close the case. When something like this happens, senior management cannot escape responsibility. Firing careless engineers will not solve the problem.

In addition to Facebook, there are other areas in the world that need to reflect on this incident. We should all know how the internet binds the world together, and we should also know how fragile the thread that connects us all is. With the press of a button, an engineer can make everything disappear. Giants like Facebook can disappear in an instant, and the entire online world could follow suit.

Over the past decade, we have tied everything to that thread… our lives, our work, our families, our friends, our careers, and even our lunches and dinners… everything is tied to that thread. Once the thread is broken, everything tied to it may disappear.

The online world is so fragile that one wrong button press can reset everything we have.

Advertisements