How Should We Think About The AWS Outage?

amazon-s3-outage

Tuesday, February 28th, was a bad day for AWS and for AWS users who relied on the US-East-1 Region in Northern Virginia to run their business and/or to serve their customers. I won’t rehash what happened but readers can get details on the outage in this TechTarget article. You an also find the postmortem from Amazon here. And yes, I know Amazon did not officially classify it as an outage, but it was effectively so for many users. As expected, both apologists and detractors have taken to social media and the Internet to defend or to bury Amazon Web Services. I’ve read everything from “If a user’s application failed, it’s all their fault” to “AWS is so unreliable even Amazon doesn’t use it” and everything between. One of the most balanced reflections was actually written by my Rackspace colleague, Kevin Jackson.

I want to take a few moments to share some thoughts on the S3 outage/slow-down and what it means for users. Then I’ll walk through some tips for architecting against Region-level failures.

Some random but hopefully relevant thoughts on the outage:

  • Infrastructure is hard and infrastructure at scale is orders of magnitude harder. Managing many thousands of servers is not the same as managing hundreds of servers, but with a more powerful script.
  • Durability is not the same as availability.  I saw many people mixing up the two yesterday.
    • S3 durability speaks to the ability of the service to protect data and to ensure they are not damaged or lost.
    • S3 availability speaks to the ability to actually access the service in order to get to your data.
    • As far as we know, no data was lost which means AWS can continue to tout S3’s eleven 9s of durability. What did take a hit was their claim of four 9s of availability.
  • As expected, many vendors are coming out of the woodwork to to talk about how much safer it would be to use their on-premises or private cloud software or hardware.
    • It is true that the blast radius of an AWS outage is much greater than for a private environment given the number of companies that run their businesses on AWS.
    • But anyone who runs on-premises infrastructure of any scale and tells you they’ve not had any significant outage is probably lying.
    • If a vendor promises you that their on-premises solution is bullet proof and always available, they are not telling you the truth.
    • On-premises solutions do fail but you don’t hear about it because the knowledge is not exposed to the public. Vendors have a vested interest in hiding any flaws in their solution and customers have a vested interest in not taking the chance that they may be criticized for their choice of solutions.
  • Private clouds have their place but no private cloud solution provides or will ever provide the same level of services and innovation that public clouds will provides. This means that while you may be able to achieve higher availability by staying on premises, the tradeoff is that you will not be take advantage of the innovations of the Public Cloud. This tradeoff manifests itself in lost opportunity cost.
  • Companies like Amazon and Netflix were not impacted by the issues in the US-East-1 Region because they were architected to survive even the failure of a Region.
  • While there is a cost to every solution, the building blocks exist to create a highly available environment that can survive the failure of a Region.
  • Too many users still do not leverage multiple Availability Zones, let alone multiple Regions for their applications.

Here are some tips on building highly available multi-region applications and on setting up Disaster Recovery in the Cloud. It’s not meant to be comprehensive or detailed but to get folks thinking differently. You may want to read my primer on Regions and Availability Zones first if you need a refresher.

  • Sound principles for building resilient systems don’t just go away because you are in a public cloud. You have to adapt those principles to work with a new architecture and with a new set of tools and services.
  • Understand the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for each of your applications so you can design the right solution for each use case.
  • Leverage multiple regions for your applications whenever possible. Below is an architecture diagram, courtesy of Pinterest, that illustrate what a multi-region deployment might look like.

pinterest-ha

  • There’s not a one size fits all solution for utilizing multiple AWS Regions. There are different approaches you can take depending on RTO, RPO and the amount of cost you are willing and able to incur and the tradeoffs you are willing to make. Some of these approaches include:
    • Recovering to another Region from backups – Back up your environment to S3, including EBS snapshots, RDS snapshots, AMIs and regular file backups. Since S3 only replicates data, by default, to Availability Zones within a single Region, you’ll need to enable cross-region replication to your DR Region. You’ll incur the cost of transferring and storing data in a second Region but won’t incur compute, EBS or database costs until you need to go live in your DR Region. The trade-off is the time required to launch your applications.
    • Warm standby in another Region – Replicate data to a second Region where you’ll run a scaled down version of your production environment. The scaled down environment is always live and sized to run the minimal capacity needed to resume business. Use Route 53 to switch over to your DR Region as needed. Scale up the environment to full capacity as needed. You get faster recovery but incur higher costs.
    • Hot standby in another Region – Replicate data to a second Region where you run a full version of your production environment. The environment is always live and invoking full DR involves switching traffic over using Route 53. You get even faster recovery but incur even higher costs.
    • Multi-Region active/active solution – Data is synchronized between both Regions and both Regions are used to service requests. This is the most complex to set up and the highest cost incurred. However, little or no downtime is suffered even when an entire Region fails. While the approaches above are really DR solutions, this is about building a true highly available solution.
  • One of the keys to a successful multi-region setup and DR process is to automate as much as possible. This includes backups, replication and launching your applications. Leverage tools such CloudFormation to capture the state of your environment and to automate launching of resources. This is particularly important if you plan to recover from backups and to create your DR environment from scratch when you invoke DR.
  • Test, test and test again to ensure that you are able to successfully recover from an Availability Zone or Region failure. Test not only your tools but your processes.

Obviously, much more can be said here and I plan to provide more details in the future about how to architect a multi-region application and to set up DR in the Cloud. Meanwhile, I hope that I’ve provided some food for thought on how we should think about Public Cloud and AWS outages.

AWS 201: What is a Default VPC?

default-vpc-diagram

To set the stage for explaining Amazon Web Services Virtual Private Clouds, I previously walked through AWS Regions and Availability Zones in another blog post. With that as the foundation, we can start taking a look at the concept of a Virtual Private Cloud and how it enables advanced networking capabilities for your AWS resources.

Virtual Private Cloud, aka VPC, is a logically isolated virtual network, spanning an entire AWS Region, where your EC2 instances are launched. A VPC is primarily concerned with enabling the following capabilities:

  • Isolating your AWS resources from other accounts
  • Routing network traffic to to and from your instances
  • Protecting your instances from network intrusion

There are 6 core components which are fundamental to being able to launch AWS resources, such as EC2 instances, into a VPC. These 6 components are:

  • VPC CIDR Block
  • Subnet
  • Gateways
  • Route Table
  • Network Access Control Lists (ACLs)
  • Security Group

Every AWS account created after 2013-12-04 supports VPCs and these accounts are assigned a default VPC in every Region. These default VPCs are designed to make it easy for AWS users to get started with setting up networking for their EC2 instances.  In the whiteboard video below, I will explain how the basic components of a VPC fit together and walk though how they are configured in a default VPC. For the rest of this post, I will walk you through what a default VPC looks like in the AWS Console and highlight important default settings.

The place to begin looking in your AWS Console is with the VPC Dashboard where you can get an overall view of what components are available as part of your VPC footprint. Note that the basic components I mentioned earlier will already be provisioned as part of the default VPC that is assigned to every AWS account for every region.

screen-shot-2017-02-17-at-1-42-14-pm

VPC CIDR Block

Select “Your VPCs” in the left sidebar and the dashboard will display all your VPCs in a particular Region, including the default VPC. A region can only have one default VPC. Although you can have up 5 VPCs in a region, only the initial VPC that AWS creates for you can be the default VPC.

screen-shot-2017-02-17-at-1-49-02-pm

Every VPC is associated with an IP address range that is part of a Classless Inter-Domain Routing (CIDR) block which will be used to allocated private IP addresses to EC2 instances. AWS recommends that VPCs use private ranges that are defined in RFC 1918. All default VPCs will be be associated with an IPv4 CIDR block with a 172.31.0.0/16 address range. This will give you 65,536 possible IP addresses, minus some AWS reserved addresses.

Subnet

Next, if you go to the “Subnets” screen, you will see that multiple default subnets have already been assigned to your default VPC, one subnet for each Availability Zone in a Region.

screen-shot-2017-02-17-at-10-10-31-pm

A subnet is always associated with a single Availability Zone and cannot span multiple AZs. However, an AZ can host multiple subnets. Each subnet in a VPC is associated with an IPv4 CIDR block that is a subset of the /16 CIDR block of its VPC. In a default VPC, each default subnet is associated with /20 CIDR block address range which will have 4091 possible IP addresses minus the 5 addresses that AWS always reserves. Note that 2 subnets cannot have overlapping address ranges.

When you launch an EC2 instance into a default VPC without specifying a specific subnet, it is automatically launched in one of the default subnets. Every instance in a default subnet receives a private IP address from the pool of addresses associated with that subnet and also a private DNS hostname. In a default subnet, an instance will also receive a public IP address from the pool of addresses owned by AWS along with a public DNS hostname, which will facilitate Internet access for your instances.

ip-addressing

Gateways

Frequently, your EC2 instances will require connectivity outside of AWS to the Internet or to a user’s corporate network via the use of gateways. For communication with the Internet,  a VPC must be attached to an internet gateway. An internet gateway is a fully manged AWS service that performs bi-direction source and destination Network Address Translation (NAT) for your EC2 instances. Optionally, a VPC may use a virtual private gateway to grant instances VPN access to a user’s corporate network.

A subnet that provides its instances a route to an internet gateway is considered a public subnet. A private subnet may be in a VPC with an attached internet gateway but will not have a route to that gateway. In a default VPC, all default subnets are public subnets and will have a route to a default internet gateway.

screen-shot-2017-02-18-at-10-18-49-pm

Route Table

I’ve mentioned routing several times while talking about the internet gateway. Every VPC is attached to an implicit router. This is a router that is not visible to the user and is fully managed and scaled by AWS. What is visible is the route table that is associated with each subnet and is used by the VPC router to determined the allowed routes for outbound network traffic leaving a subnet.

Note from the screenshot below that every route table contains a default local route to facilitate communication between instances in the same VPC, even across subnets. In the case of the main route table that is associated with a default subnet, there will also be a route out to the Internet via the default internet gateway for the VPC.

screen-shot-2017-02-18-at-9-33-01-pm

Also note that every subnet must be associated with a route table. If the association is not explicitly defined, then a subnet will be implicitly associated with the main route table.

screen-shot-2017-02-18-at-10-01-36-pm

Network ACLs

One concern you may rightly have is network security, particularly if all default subnets in a default VPC are public and open to Internet traffic. AWS provides security mechanisms for your instances in the form of network ACLs and security groups. These 2 mechanisms can work together to provide layered protection of your EC2 instances.

A network Access Control List (ACL) acts as firewall that controls network traffic in and out of a subnet. You create Network ACL rules for allowing or denying network traffic for specific protocols, through specific ports and for specific IP address ranges. A network ACL is stateless and has separate inbound and outbound rules. This means both inbound and outbound rules have to be created to allow certain network traffic to enter the subnet and for responses to go back through. For example, if you create an inbound rule allow SSH traffic into the subnet, you must also create an outbound rule to allow SSH related traffic out of the subnet.

A rule number is assigned to each rule and all rules are evaluated starting with the lowest numbered rule. When traffic hits the firewall, it is evaluated against the rules in ascending order. As soon as a rule is evaluated that matches the traffic being considered, it is applied regardless of what is indicated in a subsequent rule.

screen-shot-2017-02-19-at-8-20-47-pm

 

screen-shot-2017-02-19-at-8-35-18-pmAs indicated above, the default Network ACL in a default VPC is configured with lower-numbered rules for both inbound and outbound traffic which combine to explicitly allow bi-directional communication for any protocol, through any port and to and from any source or destination.

You can associate a Network ACL with multiple subnets but any single subnet can only be associated with one Network ACL. If you don’t specifically associate a Network ACL with a subnet, the subnet is automatically associated with the default Network ACL. This is the case with your default VPCs which have all subnets associated with the default Network ACL.

Security Groups

Security groups are considered the first line of defense and is a firewall that is applied at the instance level. This means that only instances explicitly associated with a security group will be subject to its rules while all instances in a subnet are impacted by the network ACL applied to that subnet.

Similar to network ACLs, you create inbound and outbound traffic rules based on protocol, port and source or destination IP. However there are some differences as well.

  • You can specify rules to allow network traffic but cannot create rules to deny specific types of traffic. In essence, all traffic is denied except for traffic you explicitly allow.
  • Security groups are stateful so if you create a rule to allow a certain type of traffic in, then outbound traffic in response is also allowed even if there is no explicit outbound rule to allow such traffic.

Every instance must be associated with a security group and if a security group is not specified at launch time, then that instance will be associated with a default security group.

Screen Shot 2017-02-19 at 9.04.20 PM.png

You can see from the screenshot above that a default security group will have a rule that only allow inbound traffic from other instances that are associated with the same default security group. No other inbound traffic is allowed.

Screen Shot 2017-02-19 at 9.07.49 PM.png

Looking at the outbound rules above, all network traffic out is allowed by the default security group. This includes traffic out to the Internet since a default VPC will have a route to a default internet gateway.

As you can imagine, while a default VPC may be suitable for a small non-critical single tier applications, it is not ideal for a robust production environment. That’s why it is recommended that after getting their feet wet, users modify the default VPC configurations or create custom VPCs for production use. In the next blog post, we will do just that and break down a common use case with a custom VPC that has a public subnet tier and a private subnet tier. Stay tuned.

AWS 101: Global Infrastructure – Regions and Availability Zones

In their most recent earnings call, Amazon reported that their Amazon Web Services division has reached a $14.2 billion run rate. As impressive as that is, AWS and the entire cloud market still only represents a small slice of the total IT budget worldwide. In fact, while IDC projects the Public Cloud to be a ~$200 billion market by 2020, it also projects the total IT budget in 2020 to be $2.7 trillion. Public cloud adoption is accelerating but the market opportunity is even greater than most people think since we are nowhere near market saturation. The reality is that for most companies, AWS and other public clouds are largely untapped resources and many users are only beginning to familiarize themselves with what the Public Cloud has to offer.

To help those who are new to AWS and desire to learn more, I will be writing some 101 and 201 blog series that walk readers through some core concepts and services. Along with many of these posts, I will also be posting some whiteboard and demo videos.

The first such series will be on the important topic of Virtual Private Cloud or VPC. We will be walking though the components of a VPC, including what comes with a default VPC. That will be followed by a deep dive on how to build a custom VPC and the impact it will have on network and application designs. But before jumping into a discussion on Virtual Private Clouds, it is important we understand the concept of Regions and Availability Zones since they are foundational to building on top of AWS.

The AWS Global Infrastructure is currently comprised of 16 Regions worldwide and 42 Availability Zones with 2 additional Regions scheduled to be online in 2017.

global_infrastructure_12-15-2016

Amazon Web Services Region

A Region is a geographical location with a collection of Availability Zones mapped to physical data centers in that Region. Every Region is physically isolated from and independent of every other Region in terms of location, power, water supply, etc.. This level of isolation is critical for workloads with compliance and data sovereignty requirements where guarantees must be made that user data not leave a particular geographic region. The presence of Regions worldwide is also important for workloads that are latency sensitive and need to be located near users in a particular geographic area.

Inside each Region, you will find 2 or more Availability Zones with each AZ being hosted in separate data centers from another AZ. More later on why having at least 2 Availability Zones in a Region is important. The largest AWS Region, us-east-1, has 5 Availability Zones. The current standard for new AWS Regions moving forward is to have 3 or more Availability Zones whenever possible. When you create certain resources in a Region, you will be asked to choose an Availability Zone in which to host that resource.

aws_regions

While each Region is isolated physically from every other Region, they can communicate with each other over the Amazon Global Network. This global network is a redundant 100GBE private network that traverses the globe, running through and between each AWS Region. With regional connectivity, user cans build applications that are global in reach and can survive the failure of even an entire Region. AWS can also leverage connectivity between Regions to replicate data, such as S3 object storage data or Elastic Block Storage snapshots, at the users’ discretion. An example of this can be seen in the diagram below which was shared by Pinterest.

pinterest-ha

I talk more about Availability Zones below but what is worth noting here that by replicating data across regions and using Amazon’s Route 53 managed DNS, it is possible to survive or recover from the failure of an entire region since the failure of any region should have minimal to no impact on any other regions.

Amazon Web Services Availability Zone

An Availability Zone is a logical data center in a Region that is available for use by any AWS customer. Each AZ in a Region has redundant and separate power, networking and connectivity to reduce the likelihood of a two AZs failing simultaneously.  A common misconception is that a single AZ = a single data center. In fact, each AZ is backed by 1 or more physical data centers with the largest AZ being backed by 5 data centers. While a single AZ can span multiple data centers, no two Availability Zones share a data center. Abstracting things further, to distribute resources evenly across the Availability Zones in a given Region, Amazon independently maps AZs to identifiers for each account. This means the us-east-1a Availability Zone for one account may not be backed by the same data centers or physical hardware as us-east-1a for another account.

In each Availability Zone, participating data centers are connected to each other over redundant low-latency private network links. Likewise, all Availability Zones in a region communicate with each other over redundant private network links. These Intra-AZ and Inter-AZ links are heavily used for data replication by a number of AWS services including storage and managed databases.

So why are Availability Zones such an important and foundational concept in Amazon Web Services? The diagram below illustrates a Region with 2 Availability Zones where only one of the AZs are being utilized. The architecture mirrors what a typical three-tier application running in a user’s single on-premises data center may look like. While there are redundant servers running in each tier, the data center itself is a single point of failure.

single-az

In contrast to this architecture, the diagram below illustrates the recommended practice of spanning an application across multiple Availability Zones. By placing cloud instances/virtual servers for each tier in each AZ, users are able to eliminate an AZ as a single point of failure. Amazon Elastic Load Balancers (ELB) situated at different application tiers ensure that even if an entire AZ goes offline, traffic will be directed to the appropriate AZ. It’s worth pointing out that the ELBs “live” outside the AZs and are therefore not impacted by the failure of any particular AZ. ELB is one of many AWS services that have a regional scope and can span across Availability Zones in a given Region. Other services like Route 53 is global in scope, as shown below, and provides services to multiple Regions.

multi-az

This ability to leverage multiple Availability Zones is foundational for building a highly-available, fault-tolerant application using Amazon Web Services.

Next Up: Virtual Private Clouds

With a basic understanding of Regions and Availability Zones under our belts, we can move on to Virtual Private Cloud (VPC) – what they are and how they relate to Regions and Availability Zones. In the next post, I will break down the components of a VPC and explain the concept of a default VPC which is always assigned to each AWS account. Then we will walk through how to build custom VPCs and how a properly configured VPC design can lay the groundwork for building a highly secured production environment. So, stay tuned.

Why AWS Loves and Hates Data Gravity

screen-shot-2017-01-26-at-10-30-47-am

I received the e-mail above from Amazon Web Services after recently signing up for another test account. The e-mail had me thinking about the impact of data gravity on AWS, both positively and negatively. For those who are new to the term, data gravity is a concept first coined by Dave McCrory, current CTO of Basho. It refers to the idea that “As Data accumulates (builds mass) there is a greater likelihood that additional Services and Applications will be attracted to this data.” McCrory attributes this data gravity phenomenon to “Latency and Throughput, which act as the accelerators in continuing a stronger and stronger reliance or pull on each other.” This is so because the closer services and applications are to their data, i.e. in the same physical facility, the lower the latency and higher the throughput. This in turn enable more useful and reliable services and applications.

bi-data-gravity

A second characteristic of data gravity is that as more data is accumulated, the more difficult it is to move that data. That’s the reason services and applications tend to coalesce around data. The further you try to move data and the more data you try to move, the harder it is to do because latency increases and throughput decreases. This is know as the “speed of light problem.” Practically, this means that at a certain capacity, it becomes extremely difficult or too costly to try and move data to another facility, such as a cloud provider.

Data gravity, therefore, represents both a challenge and and opportunity for Amazon Web Services. Given that the vast majority of data today live outside of AWS data centers and have been accumulating for sometime in locations such as customer data centers, data gravity becomes a major challenge for AWS adoption by established enterprises. This of course is something AWS must overcome to continue their growth beyond startups and niche workloads in enterprises. If AWS is able to remove the barriers to migrating data into their facilities, they can then turn data gravity into an advantage and an opportunity.

The opportunity that data gravity affords AWS is to continue and to extend their dominance as a cloud provider. As users store more data within AWS services such as S3 and EBS, data gravity kicks in and users find it often easier and more efficient to use additional AWS services to leverage that data more fully. This creates, for Amazon, a “virtuous cycle” where data gravity opens up opportunities for more AWS services to be used, which generates more data, that then opens up more services to be consumed.

Data gravity and the need to both overcome and to utilize it is the reason so many AWS services is focused on data and how it can be more easily moved to AWS or how it can be more fully leveraged to produce additional value for customers. Take a look below at some of the many services that are particularly designed to attenuate or to accentuate data gravity.

Services

Description
Athena Query service for analyzing S3 data
Aurora MySQL compatible highly performant relational database
CloudFront Global content delivery network to accelerate content delivery to users
Data Pipeline Orchestration service for reliably processing and moving data between compute and storage services
Database Migration Service Migrates on-premises relational databased to Amazon RDS
DynamoDB Managed NoSQL database service
Elastic Block Storage Persistent block storage volumes attached to EC2 instances
Elastic File System Scalable file storage that can be mounted by EC2 instances
Elastic Map Reduce Managed Hadoop framework for process large scale data
Glacier Low-cost storage for data archival and long-term backups
Glue Managed ETL service for moving data between data stores
Kinesis Service for loading and analyzing streaming data
Quicksight Managed business analytics service
RDS Managed relational database service
Redshift Petabyte scale managed data warehouse service
S3 Scalable and durable object storage for storing unstructured files
Snowball Petabyte scale service using appliances to transfer data to and from AWS
Snowmobile Exabyte scale service using shipping containers to transfer data to  and from AWS
Storage Gateway Virtual appliance providing hybrid storage between AWS and on-premises environments

So what are some takeaways as we consider AWS and it’s love/hate relationship with data gravity? Here are a few to consider:

  • If you are an enterprise that wants to migrate to AWS but is being held back by data gravity in your data center, expect that AWS will innovate beyond services like the Snowball and the Snowmobile to make migration of large data sets easier.
  • If you are a user who is “all-in” on AWS and has either created and/or migrated all or most of your data to AWS, the good news is that you will continue to see an ever growing number of services that will allow you to gain more value from that data.
  • If you are user who is concerned about vendor/cloud provider lock-in, you need to consider carefully the benefits and consequences of creating and/or moving large amount of data on AWS or using higher level services such as RDS and Amazon Redshift. (As an aside, the subject of lock-in is probably worth a dedicated blog post since I believe that it is often misunderstood. In brief, each user should consider if the benefits of being locked-in may be greater than the perceived liability, e.g, what if lock-in potentially costs me $1 million but generate $2 million in revenues over the same time period. Opportunity cost is difficult to calculate and generally ignored in the ROI models I see.
  • Finally, If you an AWS partner or an individual who want to work at AWS or an AWS partner, put some focus on (in addition to security and Lambda) storage, analytics, database, and data migration services since they are all strategic for Amazon in how they deal with the positive and negative impact of data gravity. This was in evidence at the most re:Invent conference where much of the focus was placed on storage and database services such as EFS, snowmobile and Aurora.

Given its importance and critical impact, AWS observers should keep a careful eye on what Amazon will continue to do to both overcome and to leverage data gravity. It may very well dictate the future success of Amazon Web Services.

 

Welcome to The Learning AWS Blog!

confession

I have a confession to make…

I was late to the party when it came to understanding the impact of the public cloud. I was tangentially aware of Amazon, the online book seller I used, getting into the virtual machine “rental” business in 2007. But as a technologist in the northeast, I heard very little about Amazon Web Services in my daily dealings with enterprise customers. To me, AWS was attempting to be a hosting provider targeting start-ups and small businesses looking to save money on their IT spend.

It wasn’t until 2011 that I started having substantial conversations with enterprise customers about the possibility of moving some workloads to AWS and to public clouds. Around this time I also started hearing traditional enterprise IT vendors talk about AWS, not other traditional vendors, as potentially becoming their biggest competitor. By 2012, I had finally grasped the power and the potential of the public cloud and the “havoc” AWS was wrecking in the IT industry. By early 2013, I was writing about why most users should adopt a“public cloud first” strategy and about the unlikelihood that anyone could challenge Amazon in the public cloud space.

That was also when I started to take a seriously looks at an open source cloud platform project called OpenStack. It looked at the time to be the top contender to be a private and public cloud alternative to AWS. That led me to join Rackspace and their OpenStack team in mid-2013.

Since then, AWS has continued to grow, along with other public clouds like Microsoft Azure and Google Cloud Platform. OpenStack has had missteps along with successes and is trying find it’s place in the IT infrastructure space. I’ve written about that as well, wondering if OpenStack is “stuck between a rock and a hard place.” My employer and a co-founder of OpenStack, Rackspace, has pivoted as well to support both AWS and Azure alongside OpenStack.

Coming into 2017, some of my thoughts about the public cloud, AWS, private cloud and OpenStack have crystallized:

  • Even more than I did back in 2013, I believe that adopting a “public cloud first” strategy should be done by every company of every size.
  • This doesn’t mean that I think companies should move all their workloads to the public cloud. What it does mean is, as I said back in 2013, companies should look to move workloads to the public cloud as their default option and treat on-premises workloads as the exception.
  • Eventually, the majority of workloads will move off-premises as more business recognize that maintaining data centers and on-premises workloads is a undifferentiated heavy lifting that is more of a burden than an asset.
  • Private clouds will have a place for businesses with workloads that must be kept on premises for regulatory or other business reasons. Those reasons will, however, decrease over time.
  • Private clouds will become a platform primarily for telcos and large enterprises and their platform of choice here will be OpenStack.
  • Everybody else will adopt a strategy of moving what they can to the public cloud and keeping the rest running on bare-metal or containers or running on VMware vSphere.
  • Managed private clouds, like the Rackspace OpenStack Private Cloud offering, not distributions will deliver the best ROI for those that choose the private cloud route because it eliminates most of the undifferentiated heavy lifting.
  • Something that could potentially change the equation for on-premises workloads is if the OpenStack project chooses to pivot and to implement VMware vSphere-like functionality. This will provide what most enterprises actually want from OpenStack – “free” open source vSphere which allows them to migrate their legacy workloads from VMware vSphere.
  • If OpenStack decides to go hard after the enterprise, they should drop or refocus the Big Tent. OpenStack has 0% chance of catching the public cloud anyway and it would be better to focus on creating and refining enterprise capabilities and to do so before the public cloud vendors beat them to it.
  • Multi-cloud is real but is not for everyone and not for every use case. It makes sense if you are a mature enterprise with different workloads that might fit better on one cloud over another. For startups, the best option is to invest in one public cloud and innovate rapidly on that cloud.
  • Amazon Web Services will continue to dominate the public cloud market for the foreseeable future even though Azure and Google will make some headway.

That last thought brings us to the reason why I am starting this new blog site – The Learning AWS Blog. I believe we are still in the early days of public cloud adoption and most users are just starting to learn what platforms like AWS can do for them. My goal with this new blog is to provide a destination for those who are new to AWS and seeking to learn. Since I am one of those who still have much to learn about AWS myself, I have found that the best way for me to learn technology is to try and explain what I know and have learned to others. I will continue to maintain my Cloud Architect Musings blog for other technologies such as OpenStack, containers, etc..

Over the coming weeks and months, I will be putting up blog posts, whiteboard videos and demo videos about AWS services. I will look to include every aspect of Amazon Web services, from the basics of Availability Zones and Virtual Private Clouds to automating infrastructure and application deployments using CloudFormation and Elastic Beanstalk, to designing scalable and highly available applications in the Cloud. I will try to provide the most accurate information possible but will always welcome correction and feedback.

In the meantime, I’ve posted some recent blog posts from my Cloud Architect Musings blog that recap announcements from the AWS re:Invent 2016 conference back in November. I hope you will find those useful along with all that I have in store for this blog site in 2017. Stay tuned and thank you for reading and viewing.

AWS re:Invent 2016 Second Keynote: We Are All Transformers

In addition to this post, please also click here to read my AWS re:Invent Tuesday Night Live with James Hamilton recap and here to read my AWS re:Invent 2016 first keynote recap from Wednesday.

After a whirlwind of product announcements from CEO Andy Jassy the previous day, it was time for Werner Vogels, CTO of Amazon Web Services, to take the stage. You can view the keynote in its entirety below. You can also read on to get a digest of Vogel’s keynote along with links to get more information about the announced new services.

Sporting a Transformers t-shirt, Vogels talked about AWS’s role in helping to bring about IT transformation. He very specifically addressed users, particularly developers, about their role as transformers in the places where they worked. AWS can do this, explained Vogels, because they have strived from the very beginning to be the most customer centric IT company on Earth.

Screen Shot 2016-12-01 at 11.35.05 AM.png

To meet their goal of making their customers transformers in their businesses, Vogels talked about three ways that AWS can help create transformers.

Screen Shot 2016-12-01 at 11.51.50 AM.png

In the area of development, Vogels emphasized the importance of code development and testing because that’s where users can experiment and where businesses can be agile.

Screen Shot 2016-12-01 at 11.54.25 AM.png

To help users transform the way they do development and testing, Vogels focused on AWS products, old and new, that help bring about operational excellence, particularly in the areas of preparedness, operations and responsiveness.

Screen Shot 2016-12-01 at 11.59.59 AM.png

In the area of preparing, Vogels talked about the importance of automating as many tasks as possible in order to build reliable, secure and efficient development, test and production environments. A key service to enable automation on AWS is CloudFormation and although no new announcements were made in this area, Vogels took some time to review the new features that have been added to CloudFormation in 2016.

Screen Shot 2016-12-01 at 12.03.25 PM.png

Many customers make use of Chef cookbooks to prepare and to configure their AWS environments. This is in large part because of AWS OpsWorks which is a configuration management service based on Chef Solo. Taking this to the next step, Vogels announced a new AWS OpsWorks for Chef Automate service. This new service provides a user with a fully managed Chef server to removed one more operational burden they had to contend with previously. You can read more about AWS OpsWorks for Chef Automate here.

screen-shot-2016-12-01-at-12-04-45-pm

Moving on to systems management, Vogels announced Amazon EC2 Systems Manager, which is a collection of AWS tools to help with mundane administration tasks such as packaging, installation, patching, inventory, etc.. You can read more about AWS EC2 Systems Manager here.

Screen Shot 2016-12-01 at 12.05.37 PM.png

Transitioning to operating as the next area of operational excellence transformation, Vogels made the argument that code development and continuous integration/continuous deployment is a crucial part of operations. After reviewing the existing services that AWS has to assist users with making the code development process more agile, Vogels announced AWS CodeBuild to go with the existing CodeCommit, CodeDeploy and CodePipeline services.

screen-shot-2016-12-06-at-4-49-51-pm

AWS CodeBuild is a fully managed service that automates building environments using the latest checked-in code and running unit tests again that code. This service streamlines the development process for users and reduces the risk of errors. You can read more about AWS CodeBuild here.

screen-shot-2016-12-01-at-12-08-29-pm

Another key aspect of operating is monitoring. As he had done previously, Vogels reviewed the existing services that help users gain visibility into their environments.

Screen Shot 2016-12-01 at 12.11.13 PM.png

Taking the next step to help users gain deeper insights into how their applications are running, Vogels harkened back to Jassy’s keynote theme of superpowers to introduce AWS X-Ray. Acknowledging the difficulty of debugging distributed systems, AWS released X-Ray to give users the ability to trace requests across their entire application and to map our the relationship between various services in the system. This insight makes it easier for developers to troubleshoot and to improve their applications. You can read more about AWS X-Ray here.

screen-shot-2016-12-01-at-12-13-40-pm

The final area of operational excellence Vogels covered was responding. How can users respond to errors and alarms and do so in an automated fashion that can also escalate issues in a timely fashion when necessary?

screen-shot-2016-12-01-at-12-16-00-pm

One answer from AWS is the new AWS personal Health Dashboard. Based off the existing AWS Service Health Dashboard, this new service provides users with a personalized view of the system health of AWS. The new dashboard will show the performance and availability of services that is being accessed by a user. Users will also receive alerts that are triggered by a degradation in the services that are being leveraged by them and users can write Lambda functions to respond to those events. You can read more about AWS personal Health Dashboard here.

techjournalist_2016-dec-01-2

AWS and their customers also have to respond to security issues. Distributed Denial of Service attacks have been the top threat for web applications with many different types of attacks at different layers of the networking stack. Historically, most of these DDoS attacks have tended towards Volumetric and State Exhaustion attacks.

Screen Shot 2016-12-06 at 5.47.50 PM.png

To address these attacks, Vogels announced AWS Shield. This is a managed service that works in conjunction with other AWS services like Elastic Load Balancing and Route 53 to protect user web applications. AWS Shield comes in two flavors – AWS Shield Standard and AWS Shield Advanced. Standard is available to all AWS customers at no extra cost and protects users from 96% of the most common attacks.

screen-shot-2016-12-06-at-5-50-38-pm

AWS Shield Advanced provides additional DDoS mitigation capability for volumetric attacks, intelligent attack detection, and mitigation for attacks at the application & network layers. Users get 24×7 access to the AWS DDoS Response Team (DRT) for custom mitigation during attacks, advanced real-time metrics and reports, and DDoS cost protection to guard against bill spikes in the aftermath of a DDoS attack. You can read more about AWS Shield Standard and AWS Shield Advanced here.

screen-shot-2016-12-06-at-5-54-27-pm

Transitioning away from transforming operational excellence, Vogels moved to transformation though using data as a competitive differentiator. Because of the cloud, Vogels asserted, everyone has access to services such as data warehousing and business intelligence. What will differentiate companies from each will other will be the quality of the data they have and the quality of the analytics they perform on that data.

The first new service announcement that Vogels made in this area was AWS Pinpoint. Pinpoint is a service that helps users run targeted campaigns to improve user engagement. It uses analytics to help define customer target segments, send targeted notifications to that target segment and track how well a particular campaign did. You can read more about AWS Pinpoint here.

awsreinvent_2016-dec-01-7

Moving on, Vogels argued that 80% of analytics work is not actually analytics but hard work to prepare and to operate an environment where you can actually do useful queries of your data. AWS is on a mission to flip this so 80% of analytics work done by users will actually be analytics.

screen-shot-2016-12-01-at-12-53-39-pm

Vogels argued that AWS already has number of services to address most of the  work that falls into that 80% bucket. To address even more of that 80%, Vogels introduced a new service called AWS Glue. Glue is a data catalog and ETL service that simplifies movement of data between different AWS data stores. It also allows users to automate tasks like data discovery, conversion, mapping and job scheduling. You can read more about AWS Glue here.

Screen Shot 2016-12-07 at 9.29.07 AM.png

By adding AWS Glue, Vogels argued that AWS now has all the pieces required to build the industry’s best modern data architecture.

Screen Shot 2016-12-01 at 1.05.23 PM.png

Another need for users in the is space, said Vogels, is large-scale batch processing which normally requires a great deal of heavy lifting to set up and use. To help here, Vogels announced AWS Batch. Batch is a managed service that lets users do batch processing without having to provision, manage, monitor, or maintain clusters. You can read more about AWS Batch here.

techjournalist_2016-dec-01-3

The last area of transformation Vogels addressed took him back to the roots of AWS – Compute. Except of course, “compute” at AWS has grown beyond Elastic Compute and virtual machines. Vogels reminded the audience that AWS compute has now grown to also include containers with Elastic Container Service and Serverless/Function as a Service with Lambda.

kongyang_2016-dec-01

Since all the new announcements about compute was made by Jassy in the previous keynote, Vogels focused on the containers and Lambda parts of their compute spectrum. For users of ECS, Vogels previewed a new task placement engine which will give users finer-grain control over scheduling policies.

Screen Shot 2016-12-07 at 2.04.53 PM.png

Beyond this, Vogels acknowledged that customers have requested the flexibility to build their own custom container schedulers to work with ECS or to integrate with existing schedulers such as Docker Swarm, Kubernetes or Mesos. To enable this, Vogels announced that AWS is open sourcing Blox, a collection of open source projects for building container management and orchestration services for ECS.

Screen Shot 2016-12-01 at 1.25.28 PM.png

The first two components of Blox will be a cluster state service for handling event streams that came from ECS and the second component will be a daemon-scheduler that will help launch daemons in container instances. You can read more about Blox here.

Screen Shot 2016-12-01 at 1.25.38 PM.png

Moving on to the last compute area, Vogels talked about serverless/Lambda. Lambda already supported a number of languages and AWS added to that list by adding support for C#.

awsreinvent_2016-dec-01-14

Vogels then mentioned that one the most frequent requests they receive from uses is the ability to execute tasks at the edge of the AWS content delivery network instead of having to go back to a source further away and incurring unwanted extra latency. To address this request, Vogels announced AWS Lambda@Edge. This new service can inspect HTTP requests and execute Lambda functions at CloudFront edge locations when appropriate. You can read more about AWS Lambda@Edge here.

Screen Shot 2016-12-01 at 1.40.13 PM.png

Finally to coordinate multiple Lambda functions in a simple and reliable manner, Vogels announced AWS Step Functions. This service gives users the ability to visually create a state machine which specifies and executes all the steps of a Lambda application. A state machine defines a set of steps that performs work, makes decisions, and controls progress on Lambda functions. You can read more about AWS Step Function here.

Screen Shot 2016-12-01 at 1.42.28 PM.png

Wrapping up his keynote, Vogels summarized all the product announcements that had been made during his and Jassy’s keynotes.

Screen Shot 2016-12-07 at 4.23.56 PM.png

With that, Vogels ended his keynote with a charge to the audience to use all the tools they have been given to go and transform their businesses.

AWS re:Invent 2016 First Keynote: Andy Jassy Is Your Shazam

In addition to this post, please also click here to read my AWS re:Invent Tuesday Night Live with James Hamilton recap and here to read my AWS re:Invent 2016 second keynote recap from Thursday.

shazamdc6

I grew up watching a TV show called Shazam! which was based on a comic I also read by the same name. The main protagonist was a superhero called Captain Marvel, who was given his superpowers by a wizard named Shazam. Captain Marvel used the power of Shazam to fight evil and to help save the human race.

At the first keynote for AWS re:Invent 2016, Andy Jassy, CEO of Amazon Web Services, played the part of the wizard who could give everyone cloudy superpowers as he wrapped the keynote around the theme of superpowers. You can view the keynote in its entirety below. You can also read on to get a digest of Jassey’s keynote along with links to get more information about the announced new services.

To set the table, Jassy started the keynote with a business update before giving what everyone in attendance and tuning in was waiting for – a litany of new AWS features and capabilities.

Screen Shot 2016-11-30 at 11.02.45 AM.png

Amazon Web Services continues to grow at an astounding rate with no let up in sight. It is by far the fastest growing billion dollar enterprise IT company in the world, suggesting that it is a safe choice for enterprises.

Screen Shot 2016-11-30 at 11.05.59 AM.png

And the growth is not just coming from startups anymore but includes a growing stable of enterprise customers.

Screen Shot 2016-11-30 at 11.03.22 AM.png

While the keynote included something for everyone, Jassy clearly had new enterprise customers in mind as he walked through the value proposition for AWS, explained basic AWS services, unveiled new services and directed his ire at Larry Ellison and Oracle. And to frame the rest of his keynote, Jassy assumed his Shazam wizard persona and explained what AWS can do for customers to give them cloudy superpowers.

werner_2016-nov-30-7

The first superpower theme to be highlighted was supersonic speed and how AWS enables customers to move more quickly. This not only refers to customers being able to launch thousands of cloud instances in minutes but the ability to go from conception to realization of an idea by taking advantage of all the many services that AWS has to offer.

Screen Shot 2016-11-30 at 11.10.22 AM.png

While AWS already boasts more services than any other cloud provider, Jassy pointed out that their pace of innovation has been increasing to the rate of 1,000+ new features or significant capabilities rolled out in 2016. That equates to an average of 3 new capabilities added per day.

Screen Shot 2016-11-30 at 11.12.53 AM.png

Continuing the focus on supersonic speed, Jassy followed with announcements about new EC2 instance types to add to the already burgeoning compute catalog. In particular, updates to four instance type families, to meet varying compute use cases, were announced.

Screen Shot 2016-12-03 at 5.13.04 PM.png

Two new extra-large instance types were added to the T2 family which doubled and quadrupled respectively the large instance type. T2 instances are suited for general purpose workloads that require occasional bursting and the extra large instances give users more bang for their buck while providing even more burst capacity. You can read about the new T2 instance types here.

Screen Shot 2016-12-03 at 5.04.17 PM.png

For memory intensive workloads, a new R4 instance type was announced which effectively doubled the capabilities of the previous R3 instance type. This memory-optimized instance type is suitable for any workload that benefits most from in-memory processing. You can read more about the new R4 instance type here.

Screen Shot 2016-12-03 at 5.08.00 PM.png

A new I3 instance type was introduced that is optimized for I/O intensive workloads. This new instance type will use SSDs to increase IOPS capabilities by orders of magnitude over the current I2 instance type. The I3 will be ideally suited for transaction oriented workloads such as databases and analytics. You can read more about the new I3 instance type here.

Screen Shot 2016-11-30 at 11.29.30 AM.png

Next up was the new C5 compute-optimized instance type using the new Intel Skylake CPU. The C5 will be suitable for workloads that require CPU-intensive workloads such as machine learning and financial operations requiring fast floating point calculations. You can read more about the new C5 instance type here.

Screen Shot 2016-11-30 at 11.30.31 AM.png

Another area where speed is important are computational workloads that require a Graphic Processing Unit (GPU) to offload processing from the CPU. Jassy announced that AWS is working on a feature called Elastic GPUs For EC2. This will allow GPUs to be attached to any instance type as workload demands require, similar in concept to Elastic Block Storage. You can read more about the Elastic GPUs here.

Screen Shot 2016-11-30 at 11.32.33 AM.png

The last new instance type to be announced was the F1 instance type utilizing customizable FPGAs which will give developers the flexibility to program these instances to meet specific workload demands in a way that could not be done with standard CPUs. You can read more about the new F1 instance type here.

Screen Shot 2016-11-30 at 11.37.06 AM.png

Accelerating how fast users can move goes beyond new hardware and new instance types. There is also the need to simplify complex tasks whenever possible. Cloud providers like Digital Ocean have carved out a strong niche market by specializing in offering no-frills Virtual Private Servers (VPS). A VPS is a low-cost hosted virtual server that is designed to be easy for users to set up and suitable for running applications that do not have high performance requirements.

AWS is taking VPS providers like Digital Ocean head on with their new Amazon Lightsail service. For as little as $5 a month , users can launch new instances in their VPC and do so by walking through minimal configuration steps.

Screen Shot 2016-11-30 at 11.34.41 AM.png

Behind the scenes, Lightsail will create a VPS preconfigured with SSD-based storage, DNS management, and a static IP address. As underscored below, all the steps in the box are performed on behalf of the user. You can read more about Amazon Lightsail here.

Screen Shot 2016-11-30 at 11.35.30 AM.png

Moving on the next superpower that AWS can give users, Jassy talked about x-ray vision and how it can benefit cloud users. The first benefit was mainly a not so subtle dig at Larry Ellison and Oracle and other legacy vendors.

Screen Shot 2016-12-03 at 11.20.12 PM.png

Jassy’s argument was that on AWS, users can run their own tests and benchmarks on true production like environments instead of accepting the word of untrustworthy vendors. It was one of many negative attacks on Oracle during Jassy’s keynote.

Getting back on point, Jassy talked about the benefit for users of being able to perform business analytics on the data they’ve uploaded to AWS as part of the x-ray vision power that AWS gives to them. Jassy then highlighted the breadth of the existing AWS services for doing analytics to help users better understand their customers.

Screen Shot 2016-11-30 at 11.41.43 AM.png

Enhancing this portfolio, Jassy unveiled a new service called Amazon Athena. Athena is a new query service for analyzing stored S3 data using standard SQL. In essence, users can treat their S3 as a data lake and perform queries against unstructured data to unearth actionable intelligence. You can read more about Amazon Athena here.

Screen Shot 2016-11-30 at 11.45.22 AM.png

Another benefit of “x-ray vision” which Jassy presented was the ability for users to see meaning inside their data through artificial intelligence. Jassy pointed out that Amazon, the parent company, has been leveraging artificial intelligence and deep learning for their own businesses.

Screen Shot 2016-12-04 at 9.51.07 PM.png

Naturally, AWS is leveraging the learnings and tools of Amazon to create a suite of new services focused on artificial intelligence called Amazon AI.

Screen Shot 2016-11-30 at 11.52.12 AM.png

The first service in the suite is Amazon Rekognition for image recognition and analysis. This service is powered by deep learning technology developed inside Amazon that is already being used to analyze billions of images daily. Users can leverage Rekognition to create applications for use cases such as visual surveillance or user authentication. You can read more about Amazon Rekognition here.

Screen Shot 2016-11-30 at 11.52.40 AM.png

Moving from image to voice AI, Jassy next introduced Amazon Polly, a service for converting text to speech. Polly initially supports 24 different languages and can speak in 47 different voices. Powered also by deep learning technology created by Amazon, Polly can translate text that may have ambiguous meanings by understanding the context of the text. User can leverage Polly to create applications that require all types of computer generated speech. You can read more about Amazon Polly here.

Screen Shot 2016-11-30 at 11.54.25 AM.png

Rounding out the new AI suite, Jassy introduced Amazon Lex for natural language understand and for voice recognition. Based on the same deep learning technology behind Alexa, which powers the Amazon Echo, users can build Lex based applications such as chatbots or anything that supports conversational engagement between humans and software. You can read more about Amazon Lex here.

screen-shot-2016-11-30-at-11-55-59-am

Another superpower trumpeted by Jassy was that of flight, which he used as a metaphor for having the freedom to build fast, to understand data better and most importantly, to escape from hostile database vendors. To incentivize users to leave their traditional database vendors, AWS had previously introduced a Database Migration service and the Amazon Aurora MySQL-Compatible database service. As it turned out, enterprises liked Aurora but also wanted support for PostgreSQL. So Jassy took this opportunity to announced a new Amazon Aurora PostgreSQL-Compatible database service.

Screen Shot 2016-11-30 at 12.48.04 PM.png

This new service uses a modified version of the PostgreSQL database that is more scalable and has 2x the performance of the open source version of PostgreSQL but maintains 100% API compatibility. You can read more about PostgreSQL for Aurora here.

Screen Shot 2016-11-30 at 12.48.21 PM.png

The last superpower discussed by Jassy was shape-shifting, which was another metaphor, this time for AWS’ ability to integrate with on-premises infrastructures. To kick off this section of the keynote, Jassy revisited an announcement that had been made previously of a joint service called VMware Cloud on AWS. This service is simply a managed offering, running on AWS, that supports VMware technologies such as vSphere, vSAN and NSX. You can read more about VMware Cloud on AWS here.

Screen Shot 2016-12-05 at 12.00.00 AM.png

Then in perhaps a somewhat tortured attempt to keep to the current theme, Jassy tried to expand the meaning of on-premises infrastructure beyond servers in the data center to sensors and IoT devices.

Screen Shot 2016-12-05 at 12.11.32 AM.png

Making the transition to talking IoT services, Jassy discussed the challenges of running device on the edge of the network in order to collect and to process data from these sensors and growing number of IoT devices.

awsreinvent_2016-nov-30-10

To help address these challenges, Jassy announced their new AWS Greengrass service which embeds AWS services like Lambda in field devices. Manufacturers can OEM Greengrass for their devices and users can leverage Greengrass to collect date in the field, process the date locally and forward them to the cloud for long-term storage and further processing. You can read more about AWS Greengrass here.

awsreinvent_2016-nov-30-11

Of course, any discussion about on-premises infrastructure by AWS ultimately leads back to their desire to move all on-premises workloads to what they consider the only true cloud – AWS. So perhaps it’s no surprise that Jassy would wrap up his keynote with two solutions for expediting the migration of data to AWS.

During the last re:Invent in 2015, AWS announced the Snowball which is a 50 TB appliance for import/export of data to and from AWS. As these Snowball appliances have been put to use, customers have expressed a desire for additional capabilities such as local processing of data on the appliance. To facilitate these new capabilities, Jassy announced the new Amazon Snowball Edge.

Screen Shot 2016-11-30 at 1.14.04 PM.png

The Snowball Edge adds more connectivity, doubles the storage capacity, enables clustering of two appliances, adds new storage endpoints that can be accessed from existing S3 and NFS clients and adds Lambda-powered local processing. You can read more about the AWS Snowball Edge here.

Screen Shot 2016-11-30 at 1.14.31 PM.png

Going back to the enterprise and rounding out the keynote, Jassy asked the question, “What about for Exabytes (of data)?” The answer, Jassy proposed, is a bigger box. Then in a demonstration of showmanship worthy of any legacy vendor, out came the new Amazon Snowmobile.

Screen Shot 2016-12-05 at 12.54.36 AM.png

The proposition of the Snowmobile is very simple. Enterprises will be able to move 100 PBs of data at a time so that an exabyte-scale data transfer that would take ~26 years to do over a 10 Gbps dedicated connection can be completed in ~6 months using Snowmobiles. You can read more about the AWS Snowmobile here.

Screen Shot 2016-12-05 at 12.55.41 AM.png

The spectacle of the Snowmobile being driven on stage proved to be an appropriate capper to the morning keynote with Andy Jassy’s turn as the superpower-giving wizard, Shazam.