By Chris Stiefeling
Anyone who has been in the workforce for more than a few years realizes that the average smartphone now packs more processing power than the computers which were available when they started their career. Most of the early increases in processing capacity have come from our ability to place more transistors in the same amount of space which resulted in increasingly faster processors. However, the ability to further shrink die sizes on computer chips is constrained by the physical characteristics of the materials involved. Since it has become more difficult to increase processing speed, chip manufacturers have been increasing processing capacity by increasing the number of processor units or cores. The side effect of this evolution is that programs must now be written to process in parallel in order to achieve performance gains.
GPU cards are a leading edge example of this trend with modern GPU cards often packing several thousand processing units into a single card. For financial institutions, GPUs may allow organizations to achieve faster processing time and a higher level of efficiency. However, the optimal solution will naturally depend on the value of the performance gains relative to the effort, cost and ability to implement alternate solutions.
In this article we explore some of the nuances of GPU cards and attempt to offer a balanced viewpoint when considering the merits of CPU versus GPU execution. We also offer some representative CPU and GPU benchmarks of insurance projections using Oliver Wyman’s ATLAS software platform.
What Is A GPU?
A GPU is effectively an enhanced video card. The card plugs into the PCIe bus of the computer and most GPU cards have video out ports. The enhancement comes from the fact that the card also has a large amount of computing capability and memory. This allows for specially compiled programs or functions to be executed on the GPU card. In this manner the GPU card can be thought of as a co-processor running alongside the main CPU.
Gaming and video processing have long been the main factor driving the development of GPUs. After all, drawing realistic zombie blood splatter to the screen takes a lot of processing effort! However, other math heavy industries such as medical imaging, biotech and finance have recognized the processing potential offered by these cards. In a non-graphics context, GPUs are sometimes referred to as GPGPUs (General Purpose Graphical Processing Units), but the differences are irrelevant from a software perspective.
The primary vendors in the GPU space are NVIDIA and ATI. NVIDIA has taken a sizable lead in this space with relatively broad adoption of their CUDA libraries. Intel has also entered the ring recently with a co-processor card called the Intel Xeon Phi. The Xeon Phi is similar in terms of physical card structure (it also plugs into the PCIe bus) but is quite different from a GPU from an operational perspective. As a result, we’ll leave the discussion around the Intel card for another day.
GPUs are structured differently from CPUs in that they are designed to process many small programs or functions simultaneously. This shouldn’t be surprising when we consider that they were primarily designed to draw millions of pixels to the screen with no visible lag. In terms of structure the GPU card can be thought of as an array of lower power CPU cores.
When comparing the CPU and GPU the analogy I like to use is a dog sled race. The CPU team shows up with 16 Huskies. The GPU team shows up with 1000 Chihuahuas. Clearly the GPU team also had to rethink their sled design.
This leaves us with three important takeaways:
- Actions taken by the GPU card must be initiated by a CPU side program.
- GPU cores and CPU cores are generally not comparable. A GPU card may have thousands of GPU cores but these are quite different from CPU cores.
- In order to utilize the GPU your program(s) must be (re)written to do so.
Interaction Between The CPU And The GPU:
As noted earlier, any work which the GPU does must be initiated by a CPU side program. Thus, in order to utilize the GPU, the CPU side program must be written and compiled specifically to do so. The GPU is generally not able to access CPU resources (CPU Memory, I/O devices, etc.) meaning that any data to be processed by the GPU must be copied from CPU memory to GPU memory. The PCIe bus is the communication bridge for these copies between CPU memory and GPU memory.
Execution of a GPU enabled program looks something illustration 1:
- Program execution starts running on CPU;
- The CPU side program performs various initialization tasks (authentication, loading input data, …);
- The CPU side program copies relevant data from CPU memory to GPU memory;
- The CPU calls for specially compiled functions to execute on the GPU. The GPU executes the functions in parallel (possibly thousands of times);
- CPU copies output data from GPU RAM to CPU RAM;
- Steps 3 to 5 are repeated as necessary; and
- Resulting data can then be processed by CPU for output purposes.
Inside the GPU:
GPU cards are optimized to execute the same set of program instructions over a large number of participants referred to as processing threads. On NVIDIA cards, program code is executed in groups of threads which execute in blocks of 32. Whenever possible, program code is executed in lock-step for all 32 participants. Conceptually this behaves something like illustration 2.
The GPU is most efficient when all threads execute the exact same instructions (note that the data will potentially be different across threads—I have denoted this with the index [t] in illustration 2). Cases which cause different set of instructions to be executed across threads (for example, by taking different branches of if-else statements) can cause performance to degrade as the GPU can no longer execute a common set of instructions across the threads in the group.
On modern chips, calculations can be performed much faster than memory fetches. A memory fetch can be 100 – 200 times slower than the calculation itself. CPUs combat this by having large data caches which cache data aggressively into fast memory which is close to the chip. GPU’s, however, employ a different strategy which involves hiding memory access time by swapping across thread groups. As soon as one group is stuck waiting for data the card will attempt to switch to another group which is ready to process. As long as there are some groups which are ready to execute instructions then the GPU can continue processing. (See illustration 3)
The design in illustration 3 has several significant implications:
- The GPU potentially needs thousands of threads in play at any given moment to operate efficiently; and
- Program efficiency is inversely proportional to the amount of data which is accessed.
We can bring this all together to identify the perfect program candidates for GPU execution:
- Very large number of iterations;
- Identical (or very similar) program code across iterations;
- A small number of input, output and intermediate variables; and
- High number of floating point calculations relative to the number of data elements.
In finance, a good example of an optimal candidate for GPU execution would be the valuation of a large portfolio of options (either by closed form or by simulation). This fits the above criteria very well: we have a large number of items/iterations to process, the valuation approach will be identical in all cases and we have a small number of variables which need to be tracked relative to the number of floating point operations.
Be Wary Of Extrapolating Performance Claims!
GPU programs are often quoted as being hundreds of times faster than their CPU counterparts—unfortunately, I feel many of these benchmarks carry a considerable bias. A few things to consider before you jump in with both feet.
- Many GPU benchmarks compare against a single threaded CPU program. So in our dog sled race analogy we get to subtract 15 of the Huskies before the race even starts! Now, in reality, many of the programs in existence today are single threaded and as a result are grossly inefficient—however, attributing the efficiency gains to the GPU seems inappropriate in this case. In reality, if you can multi-thread your code to run on the GPU then you can also multi-thread it on the CPU side.
- There is a selection bias at play—as noted earlier certain types of problems execute very efficiently on the GPU. In addition, these programs have often undergone significant code optimization for execution on a GPU. GPU performance benchmarks will tend to be slanted towards this optimal cohort as they show the most significant gains.
Part of the reason that GPUs tend to show good performance is that they force you to redesign (or alternatively re-implement) the solution in a manner which allows the GPU to execute it efficiently. In practice, performance optimization of CPU code can yield huge improvements by itself. In many cases significant performance bottlenecks are not even located within floating point intensive code sections.
As a simple example, consider a program which takes 10 hours to execute. It was written many years ago, so is single threaded and was never profiled or optimized. Assume that the program spends one hour on serial overhead activities (reading/writing files, parsing/formatting data, etc.) and nine hours doing number crunching.
Let’s assume that the number crunching can be executed in parallel. Let’s also assume that the GPU card can execute the number crunching code 160 times faster than the single threaded CPU calculation. However, to make a fair comparison we should also allow the CPU to process in parallel. This would then lead us to the information in Table 1.
So we have now arrived at a place where our expected GPU factor of 160 has been diluted down to a factor of less than 10 when compared to the original CPU run time. Furthermore, the GPU advantage has been reduced to a mere factor of 1.5 when compared against the parallel version of the CPU code.
Assuming we do another round of optimization and we are able to substantially reduce the overhead to 5 minutes. The new run times look like Table 2.
In this case, both code segments benefit significantly from reduced overhead—however, the GPU is able to now leverage its number crunching gains for a much higher factor.
In the end the GPU card still wins by a factor of 4x over the optimized CPU code—nothing to sneeze at but a really long way from the ‘expected’ speedup of 160x. Furthermore, without optimizing the CPU side ‘overhead’ portion of the program the factor was limited to around 1.5 times the corresponding multi-threaded CPU code.
The point here is to highlight that there are three choices in this scenario:
- Live with the existing 10 hour run time
- Optimize CPU code to achieve a 39 minute run time
- Optimize CPU code and move number crunching code to GPU to achieve a 9 minute run time
The best solution will of course depend on the value of the performance gains compared to the effort, cost and ability to implement and support the different solutions. In addition, we should consider the potential risk of introducing defects as well as the maintenance costs associated with the program.
Two Realistic Benchmarks
We’ll include two benchmark samples from Oliver Wyman’s ATLAS software platform. Both are examples of Monte Carlo insurance projections where one may be categorized as simple whereas the other is complex. I have dubbed these benchmarks as realistic, since I think they are representative of the types of projections which insurers perform on a regular basis. Note, however, that they do not fall perfectly into the GPU optimal category of problems as the computational density (calculations relative to memory accesses) is lower than the ideal case.
In the simple case we are making random draws to determine when specific contract events occur and generating cash flows accordingly. A modest amount of input data is required for each contract and more importantly the state information required for each simulation is limited to a handful of variables. For the simulation in question the portfolio requires roughly 900 million simulation paths to be executed.
The complex example is a variable annuity projection. In this case we are dealing with multiple market indices, multiple investment funds, numerous guarantee types and features, multiple fee structures along with dynamic customer behavior models. Each contract carries a significant amount of input data as well needing to keep track of many state variables within each projection path. This example evaluates roughly 120 million projection paths.
The variable annuity example also exhibits another issue on the GPU—memory pressure. In the CPU world we will generally only need to worry about a few projections at a time, meaning the amount of space required for state or output information is relatively small. However, for the GPU to operate efficiently we need to queue potentially thousands of projections into a single batch for the GPU to process. For simple instruments (options, bonds, etc.) this doesn’t pose any difficulty. For complex products like variable annuities or universal life products we can quickly exhaust the memory on the GPU card if we try to push everything in one pass. This combination of needing enough work to keep the card utilized but not too much to exhaust memory is a significant design consideration when targeting the GPU.
As can be seen in Tables 3 and 4, memory usage also tends to drive GPU performance. In short, almost every non-trivial program will become memory constrained at some point on the GPU. Simpler or computationally dense programs will be able to use more of the card’s floating point capacity whereas complex programs which touch a lot of memory will be able to use significantly less.
Both the CPU and GPU versions of the program are double precision and have been multi-threaded and optimized so the comparison can be deemed to be fair. In the simple case the GPU is able to offer larger performance gains. In the complex case memory accesses become a significant burden for the GPU and the performance advantage narrows somewhat.
Should We Consider Moving To GPUs?
The answer here is a solid “maybe.” Some serious navel gazing may help to surface an answer.
Question 1 – What is the value in achieving improved performance?
Moving to GPUs generally makes sense in three scenarios:
- A real reduction in the number of compute nodes required to perform the processing resulting in computing infrastructure savings. GPUs effectively allow us to pack more processing horsepower into each compute node. In many organizations this cost is related more to the number of computers than the type of computer. Running fewer compute nodes may result in real savings where charges or charge backs are based on the number of compute nodes.
- Real-time or near real-time processing requirements. Trading, hedging and other continuous time activities may require very quick turn-around. In this case GPUs will bring a solid advantage to the table. This is especially true for calculations which fall into the GPU sweet spot described above.
- Provide additional faster analytics where a tangible gain in value can be realized by having the right information sooner. Examples include being able to perform more accurate experience/attribution reporting or being able to test or simulate the impact of applying different strategies going forward in time.
Question 2 - Do we have the right resources and infrastructure to support GPUs or do we have to buy it?
GPUs cards require additional power, space and cooling. This generally translates to specific hardware requirements which may present an organizational challenge (these generally fall into the black hole known as a non-standard hardware request). Alternatively hardware infrastructure can be outsourced if the vendor can provide GPU enabled computers.
Developing your own GPU code means securing programmers who can program in C/C++ and understand GPU development, compilation and debugging. Buying GPU enabled programs will often mean that a new program must be brought in-house and existing programs converted to the new platform. GPU programs may or may not be able to run on CPU-only hardware. This may place constraints on where the program can be executed.
Question 3 - Do we have a good understanding of where our current bottlenecks exist?
In my view this is a key question. GPUs can provide a huge performance improvement for computationally dense problems (high number of independent, identical actions and a low number of inputs and outputs relative to the number of calculations). However, they will not solve any of the following:
- Inefficient/manual processes—unfortunately, we’ve all seen them—the processes which take anywhere from 4 to 14 days to run through. Manual compilation of data, spreadsheet/database updating and execution, processing manuals which say things like “open workbook x and copy data in from workbook y provided by finance.” Making the processing engine faster doesn’t help if the rest of the assembly line is broken.
- Slow data retrieval/parsing routines—early in my career we were struggling to achieve a specific performance objective on a program we had written (let’s call it 100 contracts per second). Upon profiling the application we discovered that the code retrieving data from the SQL Server was only able to provide a handful of records per second. We had essentially been assigned an impossible task and ended up having to send the data retrieval routines back to the drawing board. A lot of suffering could have been prevented had we understood that the database code was the slow spot from the outset.
Moving to a GPU will force you to resolve these issues since the value of the GPU is zero otherwise. Hence the GPU often serves as a catalyst for resolving underlying performance issues which often sit on the CPU side of the program. This would be something like buying a Ferrari and then attempting to drive it through a farmer’s field—this is generally not the ideal time to learn that you also need to invest in building a nice smooth road.
Conclusion
Unlike the past few decades performance improvements in the computing space will come from having more, rather than faster, processors. GPUs currently represent the pinnacle in terms of parallel processing technologies.
On the plus side there is the opportunity to significantly increase our processing speed and capabilities. This trend will continue for the foreseeable future but only programs designed for parallel processing will see material gains.
Since the majority of older programs are serial in terms of execution, realizing performance gains on these platforms will require either a program rewrite or a move to a platform which supports large scale parallel processing. While the upside may be substantial there will be significant frictions around budgets, skills, conversion costs, risks and platform support. Every company and each program has its own personal quirks so there is no one size fits all recommendation which can be applied.
Chris Stiefeling, FSA, leads the Atlas High Performance Computing team at Oliver Wyman. He can be contacted at www.oliverwyman.com/atlas.htm.