In an early blog posts, Building Scalable Web Systems, I discussed very high level some of the needed premises and basis to architecting scalable systems. What the post did not deal with is insurance and Downtime. What is the point of scalability if you have downtime and what is the business continuity plan that maximizes available resources. Also, the post does not deal with success. What happens and what tolerance does the business and market have in the case of massive and rapid adoption. How do you deal with it?
Enter cloud computing and Amazon’s EC2. For those not familiar, EC2 is a cloud environment that provides virtualized hosting services. They provide the hardware infrastructure, the pipes, storage and other services. You provide the application. The promise is that you can scale the hardware need horizontally without having to deal with the hardware itself and its management and upkeep.
The first question is whether I believe it is 100% ready for prime time. You can argue that loads of companies are using it successfully, thus, it is ready. I have talked to some of them to mixed reviews. You can argue that some of the unconfirmed rumors are to be believed because there are indications of truths, thus it is not ready. Also, I have talked to some people that were not all that happy with EC2. So on and so forth.
The second question is whether it matters or not if it is 100% ready for prime time. And on the hills of this question, can it be used as a business continuity tool. I will answer both below.
The obvious third is regarding cost. Through all my calculations (and other people’s), EC2 can be more expensive than running your own systems – of course at some external data center. But some of the advantages come around quick adaptability, separation of concerns, system automation and self healing procedures. I will go into more details on this later as well.
Let’s start with the first question: In my opinion EC2 is not 100% ready for prime time. It is a subjective opinion based on my findings and my level of comfort. Part of the decision is based on cost, but mostly on technical merit:
- Full virtualization is not where it needs to be; although there are ways to set up virtualization in the right configuration to make it not only more stable but also better performing. Not knowing EXACTLY how EC2’s virtualization layer works (and I am assuming virtualization) creates a big question mark on how things will truly stand up to friction. For example, it is hard to optimize a virtual machine to run DB servers that deals with millions of queries a day. Hardware optimization is important with relational DBs.
- Virtual NICs have sort-comings. They collapse under high traffic. The way to overcome this “limitation” is by attaching each virtual NIC with a physical NIC. However, this defeats the purpose of virtualization and limits, the theoretical unlimited number of VMs you can have running on a single server (only as many as you have physical NICs minus 1; you need one NIC for the host Operating System.)
- Let’s not forget performance. Even though you can create a limitless amount of VMs, the performance of each VM degrades with the provisioning of each new VM on a single server. What I do not know, however, is if there is an optimal number of VMs. In other words, is there a hard limit where before reaching that limit each VM would not change its performance characteristics regardless of number of active VMs? Not too long ago I ran a virtualized farm. Unfortunately the application I inherited was so horrible that it superseded all problems we had with the environment. So, I can not even begin to answer the last question. Needless to say that the application and environment were replaced.
- But it is not just the DBs that need “specially” optimized hardware. Application servers as well. Maybe not as specialized but a slow processor creates drag. And adding many VMs to spread the load creates more management and more moving parts adding to the risk management factor and what can go wrong.
Continuing answering questions … YES!!! It does 100% matter that they are not ready for prime time. But really, what we need to ask is the degree of how much it matters. How far is EC2 from being 100% ready? I do not know, but they look darn close. By adding granularity to the question we come up with multiple degrees of “how much it matters”. 100%?, 90%?, etc. In the case of EC2, I think it matters less than 20%. They seem that close to being ready – by my definition.
We can define cloud computing in many ways, however, let define it by a behavior: it needs to work like the electric company. Using Bob’s analogy, we do not really know how many generators the electric company has. We just know that we want/need more juice, we plug to the wall and we get more juice. The more juice we use, the more we pay. In the case of EC2, it seems to work that if you need more capacity, you provision a new “machine” and off you go – well, sort of 😉 This creates the idea that if you need more juice, plug to the wall and pay at the end what you consume. Not considering cost, it looks like an attractive proposition. But more importantly, think in terms of what it can do for you. Almost instant scalability when you need it and how you needed it.
A little digression …
I do not worry anymore about scalable systems. I know how to build them; I have come up with a methodology and an architecture philosophy and I have repeated the implementation of the methodology and architecture philosophy with great success. However, while my architectures scale horizontally without much of a inconvenience, the problem of scalability has become an issue of “need” predictability and time for procurement. Now in English: How much traffic will I get and how long does it take to get the hardware and deploy it – I consider real estate and power procurement as part of deploying the hardware.
Over the course of my experience I found that I need 3 running months to predict needs 3 months ahead. I have reduce the problem of CAPEX planning to getting right the initial installation. This initial installation needs to have “enough” capacity to support 3 months of capacity needs. But … what will be the capacity needs on the first three months? On a web based system, it is somewhat unpredictable. Sure, we could plan marketing campaigns designed to “limit” traffic. However, why would you limit and control traffic – there are a great deal of arguments in this area – if you have the potential of being ultra successful.
There is also the argument of cash flow and spending the right amounts of cash on your infrastructure. Funding is a resource and needs to be maximized. Any hardware that is bought today that is not used and needed – Software as well, but to a lesser extent – depreciates and for less cash you can buy something better in the future when the resource is truly needed. Therefore, the initial deployment of hardware becomes not only critical from a capacity point of view but also from a “capital resource” point of view. This is not to suggest, however, that you should not deploy for capacity needs earlier. In other words, stay ahead of the curve. Deploy 3 to 2 months earlier than needed. What I am suggesting is that you do not need to deploy hardware beyond 3 months or more.
Back to EC2 …
EC2 not being 100% ready creates a problem compounded by the fact that it seems to work and it seems a short ways away from being the real deal. I resolved the problem by thinking, with Bob’s help, of EC2 as an insurance policy and a business continuity plan: I will build my staging environment on EC2, even multiple staging environments.
Let’s define a staging environment as a facsimile of the production environment but scaled down. The facsimile, if at all possible, must contain ALL components.
How to set up an insurance policy and business continuity plan using “the cloud”.
First, let’s look at process and environments. I advocate and implement total separation of environments as part of my Software Development Methodologies. Developers work on their workstation and QA Engineering occurs in isolated environments that in some way represent as accurate as possible production. Staging is the environment where UAT (User Acceptance Testing) occurs and where the build is certified and readied to release. Once it is certified, it is released to production. Staging must be not as accurate as possible, but precisely a facsimile of production. By hosting the staging environment on EC2 – or any such cloud environment for that matter – you can have that precise facsimile at a small cost.
Let’s consider the case of wild success and the fact that it is hard to predict and the capacity needed to “potentiate” success. In this argument I will equate “success” to a “disaster” and how we not only recover from it but also ensure continuity:
If traffic spikes past available capacity, not only does the user experience degrades but it disappears altogether. In this case, virtually a “disaster” happened since the service becomes unavailable. In this particular disaster, having the right amount of hardware would have prevented it; as we discussed above, however, this is not always easy to determine. Just like in any disaster, the speed of recovery is vital to the continuation and success of the company. If staging is indeed 100% a scaled down facsimile of production, then on an environment like EC2 scaling up in order to provide “capacity” should be a matter of minutes to just hours and not days. Basically, enough tolerance for the business not to experience a catastrophic downtime. Temporarily moving the production environment from self managed to EC2 provides the company with the necessary time to build out, and potentially better plan, capacity on its facility. Once the “disaster” passes, production can then be moved back from EC2.
In order for this temporary migration to happen seamlessly and effectively a high degree of automation needs to be incorporated into the overall infrastructure from day one. While the last updated staging environment (there can be multiple) will have the latest code and basic configuration, its data will be not current or accurate. Data migration needs to happen on a regular basis, and all staging environments should have, based on the installed release, the latest data set. Not only the data updates must happen automatically, but the discipline of automation, from a “disaster” detection to recovery must be as automated as possible. Once an issue is detected, a single script needs to be run to get the new production environment ready for operations, including needed changes on DNS, load balancing and firewalls. Furthermore, provisioning and de-provisioning new VMs should also happen as automaticly as possible based on capacity needs.
The last part of this EC2 consideration is cost. It is more expensive than it looks. Once you start racking up the VMs on a per hour basis, racking up traffic at a premium cost and racking up storage, the $0.10 to $0.40 price ranges start to add up. This is cost that you incur every month and that you can not “lease”. So, does it add up to more than what it would cost you to build it and manage it yourself? No, but the costs are comparable, at least in my calculations. Therefore, running on EC2 for 1 to 3 months, even though duplicates the expense for that timeframe, it does not, in theory, break the bank and provides insurance, albeit, at a premium cost.
I have some strong opinions on how technology should be implemented. I do not care to know the secret sauce, but I do want to know in more detail than just general terms how things work. Especially if I am going to bet my company on a platform. The unknowns, the uncertainties based on lack of SLAs and the assumption around virtualization make me a tensed CTO. The result: Not 100% ready and trustworthy to build an company on it. I admit, however, that it is very impressive what they have accomplished, it makes sense, and of the other commercially viable cloud environments (I am not including Google, Yahoo! and MS), EC2 is the only one that, again in my opinion, is worth considering and ultimately using; whether it is for production, or as in my case, as an insurance policy to support unpredicted growth and create a conscientious business continuity plan. With time and maturity, EC2 is a strong solution.