January 2014

Incredible Heights – Achieving Massive Scale on Cloud Infrastructure

By Avi Dowlatsingh, Chris Stiefeling and Tyler Kroetsch

Avi Dowlatsingh Chris Stiefeling  
Avi Dowlatsingh Chris Stiefeling  
Tyler Kroetsch    

Tyler Kroetsch

   

One of our favorite Dilbert cartoons illustrates Dogbert spouting wisdom along the lines of “Blah, blah, blah, Cloud, blah, blah, Platform” followed by the Pointy Haired Boss stating that he is a “technologist and philosopher all in one!” (see: http://dilbert.com/strips/comic/2011-01-07/). Our intention in this article is to cut through some of the “blah, blah, blah” and illustrate some real world achievements which we have managed using Cloud infrastructure. In particular, the examples below leverage the Amazon Elastic Compute Cloud infrastructure to demonstrate the general power of Cloud computing. Note that there are many different cloud providers and service structures.

For those unfamiliar with the Amazon Cloud, it essentially provides the ability to create computer networks on demand using Amazon’s infrastructure. Computers can be created or shut down as needed and Amazon provides a variety of additional services such as persistent storage and archiving, workflow management, computer management, user management, etc. Billing is based on usage, so you only pay for computers and resources while they are in use. Idle resources can be shut down to avoid incurring additional costs. Amazon offers a variety of computer types ranging from basic single core computers to high–end workstations suitable for scientific computing. All of the examples in this article were produced using 16–core Amazon instances.

Without further preamble we will jump straight into the chart for which this article was titled:

Figure 1

com-2014-iss50-dowlatsingh-fig-02(1)

Pretty amazing isn’t it?

OK, perhaps some explanation is in order. What the above chart shows is the rate at which Oliver Wyman’s ATLAS platform can perform variable annuity projections across different hardware configurations. We define a projection path as the modeling of a single variable annuity contract through a given economic scenario.

The first bar represents an in-house computing cluster consisting of 96 cores. The typical cost (or chargeback) for a cluster of this size could be as much as $150,000 per year. For variable annuity modeling, this would also be a relatively small amount of computing horse-power.

The subsequent bars represent throughput achieved on the Amazon Cloud using different numbers of 16–core compute instances (virtual computers running on Amazon).

To put these numbers into perspective, we’ll translate the above chart into run time. In our example, we processed slightly more than 3 billion projection paths (which can be arrived at by summing the stochastic iterations over all contracts and all sensitivities being processed). As we add compute nodes we see a nearly linear reduction in run time.

Figure 1

com-2014-iss50-dowlatsingh-fig-01_jpg

Further reductions in run time are certainly possible—however, it’s a good idea to leave yourself some time to refill your coffee cup and pick up a snack.

Observation 1: If your application scales well you can select from different levels of performance on the Cloud. With typical in–house clusters the maximum capacity is defined by the size of the cluster.

Usage Models and Costs
Amazon offers a variety of services, computer configurations and usage models. To keep things simple we will compare two common approaches:

  • The “24x7 model”—in this model an Amazon instance would replace a local computer which runs continuously. Purchased by the hour, a 16-core instance on a Windows platform would cost roughly $26,000 to run per year (~ $3.00 x 24 hours x 365 days). Amazon has some options where you can pre-purchase capacity to run at discounted rates (for example, the instance above may be had for roughly $10,000 with a one-year commitment). Overall this range probably compares reasonably well with what any other third party data center would charge. Our guess is that this is also in the ballpark with respect to what an internal IT chargeback would look like.
  • The “On Demand” model—in this model the Amazon instance is started when required and shut down when no longer required. This is the fully variable case where billing will be based on the usage time for the computer. So a 16-core instance running for 10 hours will cost roughly $30 (~ $3.00 x 10 hours).In the actuarial space, running models is generally not a continuous process—some periods may not require any runs at all while other periods (month and quarter-end) may have very high computing demands. In this type of environment a fixed-size cluster will never be a good match—in periods of low demand there will be idle resources whereas in high demand periods there may be a shortage of resources.

Consider the following scenario—each month we need to do a series of runs which require a total of 800 CPU hours. Assuming an in–house 16–core compute node costs $1,500 per month (which would be relatively cheap based on our experience), we arrive at something like the following:

Cluster Type Compute Nodes Total Run Time Annual Cost
Local 20 40 hours $ 360,000
Local 40 20 hours $ 720,000
Cloud 20 40 hours $ 28,800
Cloud 50 16 hours $ 28,800
Cloud 100 8 hours $ 28,800
Cloud 200 4 hours $ 28,800

Notice anything remarkable about the cost column in the Cloud based runs? Apparently you can have your cake and eat it too—faster run time with no increase in price.

The "discount" in this case occurs because an in–house cluster needs to be sized to deal with peak demands, but we need to pay the full cost even when the cluster is idle. If the computing requirements were higher, say 4,000 hours, then the cloud cost would increase by a factor of five, but it also becomes increasingly likely that the local cluster will no longer be large enough to handle the peak demand periods.

Observation 2: Execution costs on Amazon are based on CPU hours whereas the cost of a fixed compute cluster is typically based on the number of compute nodes. Put another way, in the Cloud the cost of 10 computers for one hour is the same as the cost of one computer for 10 hours.

Spot the Savings
Amazon has another nifty feature which can be used to save even more on runtime costs. Amazon has set up a spot market where it is possible to purchase excess capacity at variable market prices. When there is significant excess capacity, we have observed compute nodes being discounted by as much as 85 percent. Spot capacity is purchased by indicating the maximum price which you are willing to pay for the instance. As long as the spot price remains below the maximum price you will be charged the prevailing spot price for the instance. If capacity decreases, the spot price will increase. If the resulting spot price exceeds the maximum you are willing to pay, your compute node may be terminated. As a result, when utilizing spot instances, your application must be hardened to withstand the possible loss of compute nodes.

We have found the spot market to be tremendously cost effective—low priority runs can be executed quite cheaply by leveraging cheaper spot market capacity. High priority runs can be run much faster by augmenting standard compute nodes with additional spot market nodes. Once again you can have your cake and eat it too—faster runs and cheaper costs!

Given that you can generally run faster and cheaper on the cloud, why isn’t everyone running models on the cloud? Unfortunately there are three big frictions around cloud computing: security concerns, legal/regulatory issues and application design issues.

Security Concerns
Controversial Statement #1—Day-to-day use of email can be far more insecure than Cloud processing when it has been properly established. There, we said it. The next time somebody raises concerns about Cloud security feel free to burst out laughing and direct them to this article (note: not the actual recommended action).

Email is usually transmitted over shared infrastructure, may contain sensitive or even confidential information and is often unsecured and unencrypted (to facilitate both searching as well as allowing different email servers to communicate with each other). Yet we all send and receive hundreds of emails per day without a second thought (aside from thinking “I get way too much email”).

Of course, pointing out the security vulnerabilities of email is probably not the right approach to use when making a Cloud business case. While Cloud infrastructure is shared (meaning many users operate on the same infrastructure) individual accounts are private. This is not dissimilar to a corporate environment where many users share the overall resources but only certain users have access to any given resource. Unlike a corporation, however, Cloud resources are available to many unrelated individuals. As a result, Cloud providers have invested a great deal of effort in providing best-in-class security and methodologies to protect data and isolating accounts from each other. Aside from demonstrating a long laundry list of third party security certifications, Amazon utilizes various security and encryption approaches (https, IPSec, firewalls, encryption) as well as running monitoring programs to detect security threats (see http://aws.amazon.com/security/ for further details).

However, whether something is secure enough is probably the wrong question to be asking; we should really ask ourselves “what are the implications when it is stolen?” This allows us to first think about how to minimize the potential damage associated with the loss of information. This way even if security were to be breached we could feel comfortable that we have done everything possible to protect any information residing on the Cloud. As the old adage goes, “locks are for honest thieves.”

Securing Your Own Data
Running models and projections may require and produce data which is confidential, but it is unlikely to deal with data which is sensitive or personally identifiable (such as names, social insurance numbers or credit card information). This is helpful as it limits the value of the information if it were to be stolen.

If we utilize the Cloud primarily for processing as opposed to storage of data we can also take some simple steps to minimize the value of information utilized on the cloud:

  • Limit data transferred to only what is needed to do the required processing;
  • Minimize the lifespan of data. Remove data from the Cloud once it has been processed and is no longer required on the Cloud; and
  • Encrypt data prior to storage on Cloud resources. It can then be unencrypted immediately prior to being processed.

Playing the Shell Game
We are strong advocates of launching computer instances when a run is required and shutting them down when they are no longer needed. Not only is this cost effective, but it also makes your computers a moving target. What presents more of an attack surface—something which exists for a few hours at a time or something which exists 24/7?

One of the more effective techniques we also utilize is to omit information which would be required to interpret or utilize the data residing on the Cloud. We happily leave behind pieces of information which are not essential to the model as well as pieces of the application which are not required for model execution. We would consider a model to be pretty secure if the corresponding model editor application is nowhere to be found in the execution environment. You can imagine the analogy of reading a financial statement where the labels have been removed. Even the most astute individual would be at a loss to interpret an obscure, unlabeled group of numbers.

Beyond these simple strategies, Cloud providers also provide an array of resources and tools which can help further enhance security and controls. Users can be given or denied access to resources as well as applying limits around what can and cannot be done by a given individual or role. A thorough review of these resources is a requirement of any project contemplating Cloud usage.

Legality
Legality is a different question from security. Just because something is secure doesn’t make it legal and vice-versa.

Addressing the question of legality will generally involve four areas:

  • Legality of data transfer—can the information be legally transferred and stored on infrastructure hosted by the Cloud provider?
    • One would generally assume that this is analogous to any decision to utilize a third party data center. Since this is a very common practice it is difficult to envision this as a barrier.
  • Software licensing—will any third party software applications need to be run on the Cloud environment?
    • Some software packages may have pre-built licensing options (e.g., Windows + SQL Server, etc.) while others will not. In general, we would start with the assumption that any licenses you have would not be transferrable to a cloud environment and then work with the vendor towards a licensing arrangement. Software vendors may have cloud specific functionality which can be used to make cloud usage even more cost effective.
  • Regulatory constraints—are there any regulatory standards which would prevent usage of a third party data center. For example, are there constraints on transferring data to an out-of-country data center?
  • Corporate policy—while not exactly law, it is generally necessary to abide by these rules. Depending on the organization there may or may not be guidelines governing the usage of Cloud and/or other third party infrastructure providers.

The degree of friction will vary from corporation to corporation—however, numerous regulated industries have successfully transitioned portions of their operations over to Amazon (including U.S. Government entities amongst others). In the end, we can only recommend building a business case and taking it through the proper legal and compliance route. More and more companies are developing policies around Cloud usage which will hopefully make the transition easier as we move forward.

Application Design
The Cloud provider along with the application itself will drive the approach used to move processing to the Cloud.

One approach is to simply attempt to replicate the environment under which the application runs, across to the Cloud. This approach will generally work, but will likely not yield any substantial benefits such as increased capability, scale or other high value benefits.

The second approach is to fully or partially integrate the application into functionality provided by the Cloud environment. This includes functionality to manage work flow and data, distribute tasks and dynamically scale processing capabilities up and down. While this requires some planning and development, this approach will also yield the largest benefits in terms of increasing capacity while decreasing costs.

Concluding Remarks
Advances in Cloud technology have occurred at a faster pace than most companies are willing (or able) to adopt them. Companies seeking to develop a competitive advantage via new business models or advanced computing capabilities have led the movement. Slower adoption rates occur where the Cloud is strictly seen as a cost savings measure—in these cases the various frictions cited above tend to slow or prevent the adoption from occurring.

Surprisingly, one of the most frequently cited problems of Cloud Computing is that it is public infrastructure. This seems like an odd reaction since we all use public infrastructure on a daily basis (to make phone calls, travel to work, send email, access health care, do our banking, etc.). While we take some precautions, there is some level of risk associated with all of these activities. However, it is difficult to imagine that we would forgo any of these activities as the benefits outweigh the risks by many orders of magnitude. Given that Cloud Computing is only a few years old we can only assume that it is just a matter of time before the risk vs. benefit relationship fully tilts and it becomes as ubiquitous as the telephone.

About the Authors
Avi, Chris and Tyler are part of the Atlas High Performance Computing team at Oliver Wyman. The Atlas software suite is a high performance platform which is best known for its Variable Annuity modeling capabilities.

For more information about the Atlas team please visit http://www.oliverwyman.com/atlas.htm.