Archive for September 2010

GTC 2010 Trip Report

I spent the week of September 20th though 23rd at NVIDIA’s GPU Technology Conference, and I’m now reporting back on some of the events that particularly interested me. I went to a whole range of talks, varying wildly in subject and quality, so I will gracefully ignore some of the talks while focusing on events that relate to my research or particularly struck my fancy.

As is usually the case with these conferences, I find that the most valuable experience is meeting the community of researchers around the world. Talks, papers and the like can be read by yourself, but conversation and networking at these conferences makes the visit absolutely worth going. GTC was no different, and I connected with several people that I aim to stay in touch with as research colleagues. This also seems to be an apt place to start my trip report! So, I met the following people during the week:

  • Tobias Brandvik (Cambridge), doing stencil abstractions for fluid flow solvers on multi-GPU clusters
  • Bryan Catanzaro (UC Berkeley), building Copperhead, a data-parallel python derivative that completely abstracts away the GPU.
  • Vasily Volkov (UC Berkeley), working on low-level performance tuning of GPUs and breaking down the CUDA abstractions.
  • Jason Cohen (NVIDIA), part of the Parallel NSight debugger team (Rick Shane from Adobe introduced us)
  • Nathan Bell (NVIDIA), the other half of the Thrust developer team (Jared Hobernock introduced us)

I will now attempt to distill the most interesting talks I attended to their core observation, and any notes I found especially interesting.

Opening Keynote

Jen-Hsun Huang, NVIDIA CEO

Jen-Hsuan was very much pushing on the importance of parallel programming, reiterating the arguments about the power wall the industry has hit, and pushing the fact the NVIDIA has been building highly parallel hardware for years now. The Senior Vice President of Content and Technology, Tony Tamasi, shows off several demos of Fermi’s tessellation capability (an endless city, procedurally generated, and tessellated on the GPU to give the equivalent of a 100 billion triangle scene). He moves on to the physics simulation capabilities of these GPUs by showing a re-imagination of Ron Fedkiw’s lighthouse scene running in real-time. A multi-grid height field combined with particles give real-time water, while flotsam is simulated as rigid bodies. All three simulations are coupled, and run in real time. Although it still looks simulated, it’s definitely ahead of today’s games.

The big statistic here for developers is the rate of CUDA adoption. NVIDIA very much pushes the idea that they have 100 million GPUs out in the field, all that can run CUDA programs. The reality of the situation is, naturally, not nearly this good, but it’s a nice statistic to have. The Folding@Home and SETI@Home people are reporting massively skewed statistics towards people running massively parallel processors, so there’s surely some truth to these numbers.

NVIDIA accounced CUDA-x86, a new compiler from PGI that allows compilation from CUDA code to x86 code, allowing developers to write programs that runs on multicore processors or throughput-based GPUs. In my mind this is just a nice-to-have, since none of the serious optimizations you do for the GPU (think coalesced memory accesses, specific thread groupings to exploit vector lanes and vector lane synchronization) will carry over to x86, and might even hurt performance (cache misses being caused by GPU-focused optimizations). Still, the write-one-run-anywhere dream is clearly very important, which is great for the research I’m working on.

Several other impressive demos were also shown: Dr. Black’s beating-heart-surgery that tracks a heart in real time to make incisions with a robotic scalpel, Abobe’s David Salesin showing off refocusing by using Plenoptic Lenses (originally done by Ren Ng from Stanford) and the iRay photorealistic raytracer running on 64 Fermi’s, rendering to your web browser at interactive rates. Clearly graphics has lots of evolution left as it enters the world of massively distributed computing.

Lastly, NVIDIA announced that their next two chips – Kepler and Maxwell – will have those codenames, and will aim for 3 times and 10 times the performance per watt of today’s Fermi’s.

A Fast, Scalable High-Order Unstructured Compressible Flow Solver

David Williams & Patrice Castonguay (Stanford)

I was curious to find out how this group built their flow solver to run on a GPU cluster. Since this is an example of what we’d like Liszt (our research language) to be able to do, so seeing a hand-written version was profitable. They followed the same MPI ideas as is generally used – partition your unstructured mesh and create ghost cells for the data you want to share across partition boundaries, placing a partition on each machine. They implemented their algorithm using a gather approach: The GPU would perform two stages of work, a first stage to calculate cell-based values, and a second stage to reduce these values to edge-based values. The synchronization between these two stages would include the MPI all-to-all step to resolve ghost cell values.

Since they wrote a specific instance of a RANS algorithm, they did not do any scheduling work or fine-grain synchronization, their two-stage gather was enough to run the algorithm. They were getting good linear speedups on their cluster, and managed to achieve a sustained 1.3 Teraflops on a cluster of 16 GPUs using a mesh of 320 000 cells.

New Programming Tools GPU Computing

Sequoia, Copperhead, GMAC, Thrust

Unfortunately, panel discussions with 3 minute introductions for each project is never enough to really understand any of the projects. The most striking part of this panel was the obvious programming language direction researchers have taken. Except for Thrust (although it can be considered an embedded domain specific language) all the work has a programming language spin on it. The major concern of the audience was clearly the support issues and feature-scarcity of new programming languages, which the different projects addressed differently – Sequioa tries to be a runtime more than a full language, Copperhead attempts to be deeply coupled to Python, Thrust passes itself off as a library and GMAC aims to be language-agnostic, creating a universal address space between accelerators (GPUs) and processors that any language can take advantage of.

PyCUDA (2041)

Andreas Klockener

Andreas’ PyCUDA talk was mostly an introduction to PyCUDA and a brief overview of how it works and the motivation behind it. I found this talk especially interesting, since he took an approach very similar to the way web frameworks use templates to generate web pages. Kernels in PyCUDA are strings of text, with embedded Python variables that is replaced when you ask his framework to compile the kernel. He built this JITting engine as an extention of Python, allowing you to write kernels at runtime and pass it off to the nvcc compiler to generate CUDA code. I liked the fairly low level control he allows you to achieve inside of Python, but PyCUDA does not attempt to abstract away CUDA or the GPU. It is, rather, very similar in spirit to the Boost Python bindings – allowing you to build software in Python, and rewrite the slow parts in C (or CUDA), calling from Python these low-level functions directly. PyCUDA has the added benefit that you do not even need to leave the Python interpreter. His whole approach was fascinating, especially since this is what I would have done were I faced with a similar problem, given my web framework experience. Andreas likens this to the LISP-style metaprogramming that’s been around since the 60s – manipulating string kernels, “pasting” in values on the fly.

PyCUDA in general is built to interface tightly with numpy and scipy, two Python packages that supply matlab-like functionality to Python users. PyCUDA does not attempt to address the type inference issue of moving from a dynamically typed language to a statically types one, since it depends on the user to write kernels with the correct types, and on numpy to supply runtime types of the multi-dimensional arrays that PyCUDA works with. Copperhead, Bryan Catanzaro’s data-parallel version of Python, abstracts away the GPU entirely, thus it has to deal with type inference, and he built a Hindley-Milner style type inference system into Python to handle this. Copperhead is built on top of PyCUDA, so he uses the JITting capabilities of PyCUDA to get to the GPU – a great decision in my mind, since someone else is now responsible for the low level details of catching errors and generating kernels.

Better Performance at Lower Occupancy (2238, Wed 15:00)

Vasily Volkov

(Slides here) Vasily has published several papers on understanding GPU hardware and tuning codes for the GPU, and his talk addressed the focus on massive multi-threading of GPU apps, showing the Instruction Level Parallelism is still a very important approach for the GPU. In the process of demonstrating this, he disproved several of NVIDIA’s claims in their Programming Guide. This talk was very interesting to me, since it addressed many of the low level architectural questions myself, Kayvon, Solomon, Zach and Jorge has discussed in detail. The use of the word “occupancy” in this talk refers to the percentage of threads spawned out of the total supported number on the multiprocessor.

The general recommendation to hide latencies is using more threads per block and more threads per multiprocessor. Vasily demonstrates that faster codes tend to run at lower occupancies, citing as examples the differences between CUBLAS and CUFFT versions – every performance improvement came with a lowering of threads per block. Vasily shows in the talk how to hide arithmetic latency and memory latency using fewer threads, and get a total performance increase with fewer threads. He also attempts to disprove the fallacies of shared memory being as fast as register files, addressing the bandwidth differences between the two.

The heart of his talk is the fact that Streaming Multi-Processors are still pipelined machines, regardless of the multi-threaded wide-vector-lane nature of these processors. By writing your code as sets of independent operations, keeping data dependencies to a minimum and structuring code to keep the pipeline filled, you can get massive performance regardless of the machine’s occupancy. He shows the roofline model for a simple SAXPY code, and how he can influence the memory-bound part of the model by doing multiple independent SAXPY operations in a single thread (since all but one of the input values are the same in a SAXPY operation). He continues to show that he can get 87% of the peak bandwidth available to an SMP at only 8% occupancy (while cudaMemCpy achieves only 71% of peak). Lastly he makes the point that the banked nature of shared memory makes it impossible for shared memory codes to achieve full bandwidth utilization. This leads to the recommendation to use as few threads using as much registers as possible.

The attention to detail in this talk, as Vasily played with the limits of the GPU, allowed him to break down several of the “nice” abstractions CUDA provides.

Large-Scale Gas Turbine Simulations on GPU Clusters (2118, Wed 16:00)

Tobias Brandvik

Tobias and his group at Cambridge has addressed a very similar problem as we have with Liszt. They want to write a mesh-based simulation system that runs on today’s high performance machines, without having to rewrite the code for each architecture. Specifically, they are building a production-quality solver for use in the Aerospace industry. Their approach has many overlaps with Liszt, targeting GPU clusters while avoiding the need to rewrite all their code for every variation of today’s heterogeneous machines. In contrast to our approach, they work at a fairly low level, since they only attempt to rewrite the approximately 10% of their code base that is stencil-based calculations. A stencil is defined as mesh accesses and data read/writes a kernel will perform, which allows them to generate specialized CUDA source code for each stencil in their application. This 10% of the code is roughly responsible for 90% of the run time, and can be abstracted as math kernels running specific stencils across a mesh.  The cluster aspect of their code is still hand written MPI code, but rather than write GPU or SMP specific codes that runs on an MPI node, they use these stencils and math kernels.

In terms of domain-specific optimizations, he referred to the 2008 Supercomputing paper by Datta et al that showed a set of optimizations to run stencil codes at high performance on GPU devices. They attempted to implement these optimizations as part of their source-to-source compilation process for kernels.

Their approach requires the programmer to hand-write the stencil and the math kernel. This approach allowed them to embed this stencil language partially inside fortran. They then took their current simulation system (TBLOCK, approximately 40kloc in Fortran) and factored out the stencil-based calculations into separate stencil definitions and math kernels. This allowed them to keep most of their current simulation code while spending “a couple of months” (Tobias’ words) on rewriting calculations that fit this stencil scheme into their embedded language with accompanying stencil definition files. Their system, called TurboStream, has on the order of 15 different stencils in it, with 3000 different stencil definitions, and they run simulations on a 64-GPU cluster at the University of Cambridge.

Tobias made an interesting comment during the talk, saying that their biggest concern is not pure speed, since their solvers are relatively simple, but that they want to be able to handle much more data than they currently do – this was their biggest motivation for moving to GPU clusters. Per way of example, he showed the fine detail of turbofan fins spinning through slits cut in the housing of a jet engine, and the fine-grain simulation details around these slits – geometry that was previously ignored.

Overall Impressions

The biggest gain of the conference was the networking with several other researchers, and getting an overall view of the field as several groups attempt to solve the same problem – how do we write codes that can run on all of today’s hardware choices.