On the evening of Friday 16th of August 2013, Google went down for 5 minutes. The result caused global internet traffic to plummet by 40% during that time. Since then, any issues that the search platform has had, have only been localised to individual countries or regions.
Today, most household digital names are considered too big to fail. However, over the past few months we have seen a number of outages from most of the major online players which understandably caused immense panic and frustration. These outages clearly highlight how both businesses and people rely heavily on these services and also the risks that we are faced with when they go down.
On 2nd June, there were Gmail, YouTube and Snapchat outages due to issues with Google’s own Cloud service. These affected the East Coast of the US and parts of Europe. Down Detector, a website that detects when services such as internet providers, mobile providers and online services are down, reported YouTube outages in several countries. Vimeo and Shopify were also somewhat affected.
Gmail and Google Drive suffered a worldwide outage on 7th May. Although some were able to access their emails, when attempting to send them, they received error messages. More recently, Google Calendar suffered an outage for nearly three hours on 18th June, with users encountering the dreaded “404 error” message upon attempting access. To put this into perspective, there are more than 1 billion monthly users of Gmail services that did not have access to their calendar information.
On 13th March and 3rd July Facebook went down, with Instagram and WhatsApp also being affected, preventing people from uploading any pictures or videos. Issues also occurred when users needed to login into other apps that required Facebook connections.
Cloudflare, a content delivery network (CDN), went down on 2nd July taking a number of websites with it ironically included, Down Detector. Cloudflare is the world’s most popular CDN service in the business with 34.55% of the market. Amazon CloudFront is second with 28.84%. With over 16 million Cloudflare-protected sites, including the likes of BuzzFeed, Sling TV, Pinterest, and Dropbox, all off these services cease to work when it is down. A routing issue prevented people accessing Discord, Google, Amazon (and utilities like Verizon and Spectrum) effectively taking down major services at the start of the working week, according to multiple reports.
Netflix went down on 20th June for an hour preventing customers from accessing the service.
On Thursday 28th March, all HubSpot services joined the club and suffered an outage. This includes the websites that they host, their CRMs, and any people working from their portals. That is 56,500 clients with no website or access to their web data and contact information.
Feeling a bit left out, Apple’s iCloud service decided to also go offline for three hours affecting the App Store, Apple Music and Apple TV on Thursday 4th July.
Amazon claim that they have never had a complete data centre go down. However, an AWS customer suffered an attack from a disgruntled sacked IT consultant who subsequently hit delete on a whole range of business critical data. He went through his former employer’s AWS accounts with an old login, nuking 23 servers and triggering a wave of redundancies. The culprit did end up serving jail.
Whether it be cloud services, routing issues, internal or external malicious attacks, no service is entirely immune. If you and your business rely on any of these services, it might be worth considering a back-up plan to ensure that you have availability to critical data when things go wrong or heaven forbid your data is irretrievable. What would you do if you lost all your Google Calendar, iCloud, or Google Drive information permanently?
There is possibly an argument to say that in some cases compensation may be owed. For example if you pay £600 for a monthly service and it went down for a day, are you owed a percentage of your fee?
Management and responses
All of these issues were responded to in similar ways. For example, Google provided information on their various service dashboards, as well as releasing Twitter statements. It is essential to remember that if a company does not respond quickly, either on their own status pages or on Twitter, the Twittersphere will do it for them. Regardless of reliability, Twitter is where people now turn for real-time information if they cannot find it immediately and directly from the source. Providers seem to officially acknowledge the issue, state that they are rectifying it (sometimes providing an estimate on how long the issue will take to fix) and then announce once it is resolved and then apologise.
Google have acknowledged issues with formal statements such as, “The affected users are able to access Google Drive, but are seeing error messages, high latency, and/or other unexpected behavior.”
Matthew Price, Cloudflare CEO, tweeted: “Aware of major @Cloudflare issues impacting us network-wide. Team is working on getting to the bottom of what’s going on. Will continue to update.”
Facebook tweeted “We’re aware that some people are having trouble uploading or sending images, videos and other files on our apps. We’re sorry for the trouble and are working to get things back to normal as quickly as possible. #facebookdown.”
After the Gmail and Google drive outage in May, Google wrote on their App Status Dashboard “The problem with Gmail should be resolved. We apologize for the inconvenience and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are making continuous improvements to make our systems better.”
No matter how services decide to respond or react to outages, the follow-up cannot be forgotten. The service needs to allay fears and be as transparent as possible. What was the cause and how was it corrected? How will it be avoided in future?
It’s important for us all to be prepared, not only with contingency plans that manage the issue but also social strategies to manage the communication with your clients. How would you respond to your users in the case of an unpredicted break in service? Whatever words you actually choose, honesty and integrity will always be appreciated by your clients.