Amazon issues apology for Christmas Eve cloud service disruption
On Christmas Eve, Netflix users in the US were left without their favorite streaming services for around seven hours, as an outage affected video streaming services on certain devices, including games consoles. The root cause of the outage was traced back to the actions of one of a very small number of developers with access to the production environment at Google, one of Netflix's largest cloud customers.
The outage was due to an inadvertent deletion of data during a maintenance process. The affected data recorded the state of load balancing systems, which play a crucial role in distributing network traffic across multiple servers to ensure smooth operation. In this case, only a fraction of the load balancing systems were affected, but it took a significant toll on Netflix's ELB instances.
A handful of Netflix's ELB instances lost their ability to pass requests to the servers behind them, causing a ripple effect that disrupted the streaming services. It took 24 hours for Google to confirm that the service was fully restored.
Netflix, known for its transparency about technology usage, has shared a number of open source tools for managing cloud services. One such tool is Chaos Monkey, a cloud testing system that helps companies build resiliency in the cloud. However, even with such tools in place, the incident serves as a reminder that more needs to be done to ensure cloud innovation is resilient and reliable.
Adrian Cockcroft, Netflix's director of cloud architecture, stated that it's still early days for cloud innovation, and more needs to be done to build resiliency in the cloud. In response to the incident, Google has tightened up its change management process and altered its data recovery process to prevent such incidents in the future.
Google apologized for the outage, which took Netflix offline on Christmas Eve. Despite the inconvenience caused, Netflix users can rest assured that the company is taking steps to prevent such incidents from happening again in the future. Netflix uses hundreds of ELB instances to support distinct services or different versions of a service, and the company is committed to ensuring the smooth operation of these services for its users.