AWS S3 Outage, Frameworks during Disruption, Reasons of S3 Disruption

The Mishap
Due to s3 Outage, a number of websites began to experience issues with their
services and a lot of other websites went down completely. It appeared to be a
severe outage at Amazon Web Services, the company’s sizable cloudcomputing
business, which hosts vast swaths of cyberspace.
The Simple Storage Service (S3) in the USEast (North Virginia) region was
disrupted for approximately 5 hours. Even the status indicators for AWS Services
displayed contrary results as they rely on AWS S3 for storage of its health marker
graphics and thus resulted in a massive disruptive impact on the companies
running their production workloads on AWS.

 

What is S3 and how it works?

Amazon Simple Storage Service (Amazon S3) is object storage with a simple
web service interface to store and retrieve any amount of data from anywhere
on the web. It is designed to deliver 99.999999999% durability, and scale past
trillions of objects worldwide.

Customers use Amazon S3 as primary storage as a bulk repository for usergenerated
content, as a tier in an active archive. Amazon S3 is a keybased
object storage which means every time one store’s data, a unique object key is
assigned to retrieve the data in future.Amazon S3 replicates the data across the
multiple devices within the Region although it follows an eventual consistency
model for its data consistency. This means that one may not be able to read the
latest version of data even if there is an update in the S3 object. This is due to an
absence of status and information from AWS during the time of replication of
objects between the AZ’s.

Acknowledging the mishap

In a statement, Amazon said: “Unfortunately, one of the inputs to the command
was entered incorrectly and a larger set of servers was removed than
intended.” An engineer servicing Amazon’s S3 system using an established
playbook executed a command and pressed a wrong button which rather than
taking a handful of servers offline for servicing, took a whole slew of them offline
which supported two other S3 subsystems

One of this subsystem was the index subsystem that is accountable for
managing the metadata and location information of all S3 objects in the
regions and serves all GET, LIST, PUT, and DELETE requests.
The second subsystem was the placement subsystem which is responsible for
allocation of new storage and is reliant on index subsystem to function properly.
Removing a significant portion of the capacity caused each of these systems to
a complete restart and hence resulted in not processing service requests by S3.

All that you can do!

Undoubtedly since the reason of disruption was due to a single typo, a need to
build a framework at the time of the hour for future is required.

Solution 1:
Configuring Amazon S3 cross region replication. This provides automatic, bucket
level asynchronous replication of objects in different AWS Regions. Configuring
AWS Lambda, AWS SNS and Amazon Route53 along with Amazon S3 will aid in
showing high API error rates on AWS Service Health Dashboard by setting up the
showing high API error rates on AWS Service Health Dashboard by setting up the
SNS notification for triggering a lambda function to swipe the Route 53 entry.

Pros:

This is an automated process and no manual intervention is required.
Cons:
The Route 53 has to be configured for a low Time to Live.
The produced data can be obsolete due to asynchronous replication.
AWS Service Health Dashboard can difficult as they rely on S3 as well.
The latency of data transfer is high.
Solution 2:
Configuring Amazon S3 cross region replication and using a secondary bucket
URL as fail safe to avoid failure of first API calls.
Pros:
The configuration of Route 53 is not required.
There is no need to configure Amazon SNS and Amazon Lambda
Cons:
The latency of data transfer is high.
Automating the S3 URL swapping on the code level can be intricate for
Developers.

Solution 3:
Writing the metadata of the S3 objects in DynamoDB whenever a PUT operation
is performed on the S3 bucket. Storing S3 metadata in DynamoDB to ensure that
write operation on S3 would be written on S3 as well as on DynamoDB to
perform get operation. All S3 read/list operations need to be rewritten
to query DynamoDB so that the applications rely only on the metadata stored in
DynamoDB. In the case of failure, it is easier to update the metadata from
DynamoDB and point it to the bucket which has the replicated data.

Pros:

The URL update is only done in DynamoDB.
Cons:
This can be a challenging program to code.
The latency of data transfer would still be an issue if CloudFront is not used.
Conclusions
Reduce Blast Radius isolation by using Multiple AWS Accounts per Region
and Service for limiting the impact of a critical event such as if an AWS
Region or Availability Zone becomes unavailable.
Provisioning for future growth requires continuous iteration and adaptation
of design. It is also necessary to design a framework that caters for elasticity.
Multiregion design is important and easier than multicloud.
No technology is ever 100% fail proof, and hence strong operational
performance is mandate.

Refer the main document at below link

http://blog.blazeclan.com/outage-amazon-web-services-aws-architected-framework/

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s