Archive for 2010

Design Principles for Structuring iPhone Apps

One of the biggest hurdles between writing small examples and building large applications is the principles for organizing and structuring code. Learning how to implement complicated algorithms might make you a good computer scientist, and is rigorously taught in schools, but structuring large applications has always been more of a black art to me. Especially in writing code that interacts with the real world, since we now deal with things that happen asynchronously.

Thus, I’m putting together some of the major design principles in iPhone programming. I’ve been writing a music player app over the December holiday, and much of the software design comes from previous projects, reading the iSoul and Dropbox API code, and reading StackOverflow, and here I’m putting some ideas together.


- (BOOL)application:(UIApplication *)application didFinishLaunchingWithOptions:(NSDictionary *)launchOptions {
window = [[UIWindow alloc] initWithFrame:[[UIScreen mainScreen] bounds]];
[window makeKeyAndVisible];
return YES;

Firstly, Cocoa-based apps all start off with an AppDelegate. This is your main entry point into the world of iPhone and Cocoa apps, and here is our first and only place to store state for the duration of the program (I’m about to extend this idea, so bear with me). The AppDelegate is the entry from where Controllers (handling logic) is launched and long-running state is handled. This leads us to two main ideas:


Cocoa follows the idea that views (displayed to the user) is separate from business logic and data management (written in Controllers) and data representation (Models). This means, your app will contain plenty of controllers responsible for both handling user interaction and handling interaction with data storage and data sources.

A controller can be an object responsible for talking to a REST API, or an object responsible for presenting a view to the user and capturing mouse and keyboard interaction with this view. I personally like to distinguish between these two by calling the former Managers and the latter ViewControllers, but they both fit into the MVC model the same way. I might get some flak on this, because people will disagree with me whether a data source is a model, a controller, or something else. I prefer the controller nomenclature since I tend to consider data managers and view controllers structurally identical.

A model is a concise description of a piece of data. The controller talking to a remote API might return an array of model objects. My code might contain a Person object (a model), a PersonManager (a controller responsible for managing person objects) and a PersonEditViewController (a controller responsible for presenting views that modifies a person object). Why this separation? Because now I can encapsulate the generic person logic into the Person model, the storage logic into the PersonManager, and the front-end logic into the ViewController.

Now I have all these controllers, managing data and presenting views, floating around. Structuring them and storing state brings me to the second idea:

Dependency Injection (and Singletons)

How do we manage global state? Or, in the case of our example, how does the PersonEditViewController work with the PersonManager to get a Person model object, display it to the user, capture the user’s edits, and store the new Person object?

The PersonManager can be made into a singleton object. Now, anywhere in the code, [PersonManager sharedManager] can be called and we can get the instance of this Manager. We can imagine “getPerson” and “savePerson” functions on this instance, and we’re off to the races. This is a great way to structure libraries, since the global library initialization code can be captured in some singleton instance, and we can use this library anywhere. In fact, this is how the heart of the Dropbox API is structured – on app load, a DropboxSession singleton instance is created, which captures the authentication details of your app. Anywhere in your code that needs to dropbox access starts off by getting the singleton instance of the DropboxSession and works from there.

Singletons break down the otherwise nice object-oriented nature of our code and makes testing much harder, since we now cannot unit-test objects with dubby instances of a singleton without modifying the singleton creation code. Singletons are no different from global variables – they break the modularity of your system, and makes code harder to read and harder to reason about. The alternative is dependency injection. Initially we said that our AppDelegate is our only place to store shared state. Clearly, singletons allow state as well, but a more modular approach is using the AppDelegate to place major long-running state. So, we place our PersonManager object (the controller responsible for managing Person model instances) as an instance variable on our AppDelegate, and we inject it into any other controller that needs to access it – PersonEditViewController now has a “setPersonManager:” call or an “init:withPersonManager:” constructor. Structuring code this way makes dependencies perfectly clear and allows for unit testing by passing dummy dependencies into an object.

Now that we have  a way to structure out code in general, we need to manage data flow between these controllers. This leads us to talking about delegation and callbacks.

Delegation and Target-Action (Callbacks)

Both these approaches encapsulate the idea of communication through callbacks. Delegation is the approach of having a specific object – the delegate – be responsible for all the callbacks from the host object. Target-action is an approach where the host object can inform many “delegates” about behavioral changes. Let’s keep this rooted in practice.

Whenever an object creates a new object to do a specific task for itself (for example, a Controller creating a View to present to the user, or a Controller creating a socket connection to a remote api) the one-to-one communication between this worker and the original object can be asynchronously captured by having the main object as the delegate of the worker object. The worker object has a weak reference to its delegate, and can call methods on its delegate to communicate. These methods are captured in a common interface – in Objective-C by using Protocols, the equivalent to Java’s Interfaces. For example, the Dropbox API has a controller responsible for letting the user log into dropbox. You can create this controller at any point, set yourself as the delegate object, and display it to the user. When the login is done, your object’s succes or failure methods gets called:

Your controller:

- (void)didPressLink {
DBLoginController* controller = [[DBLoginController new] autorelease];
controller.delegate = self;
[controller presentFromController:self];

- (void)loginControllerDidLogin:(DBLoginController*)controller { [self updateButtons]; [self.navigationController pushViewController:photoViewController animated:YES]; }  - (void)loginControllerDidCancel:(DBLoginController*)controller { }
And, in the DBLoginController, you see things where the delegate gets called:
- (void)didPressCancel { [self setWorking:NO]; [self.navigationController.parentViewController dismissModalViewControllerAnimated:YES]; [delegate loginControllerDidCancel:self]; }

Delegation is especially nice, since you can define many callback methods as part of the protocol, implement the ones you care about, and simply register yourself as a delegate. “worker.delegate = self” makes all the methods you wrote available to the worker. The target-action approach we’re abou to see only connects a single method of yourself to a “worker” (bad nomenclature, sorry!), but everyone gets to join the fun.

So, delegates work great when a single worker is spawned to do something for an object – where some long-running object creates a worker to do something for it – but it does not work in the case where many objects want to know when something asynchronously happens. If we want one-to-many communication, we can’t just have a single delegate. We can still structure communication around a Procotol and setting up a list of callbacks, or use the slightly looser Target-Action approach. Here is the callback approach:

- (void)registerPlayerStateCallback:(id <PlayerStateCallbackProtocol>)callthis {
[_callbacks addObject:[callthis retain]];
- (void)notifyCallbacks {
NSEnumerator * e = [_callbacks objectEnumerator];
id <PlayerStateCallbackProtocol> callback;
while (callback = [e nextObject]) {
[callback playerStateChanged:_state];


These kinds of callbacks (or target-action) works especially well when you have some object that needs to know when long-running state changes. See the difference? When some long-running object spawns a worker to do something asynchronously, we use a delegate. If an object wants to know when (potentially long-running) state changes, the object registers an action of itself as a target of a state change. Yes, the two definitely overlap, but the one-to-one versus one-to-many differentiation helps in deciding which one.

Say, in our original example, a MainViewController shows a list of people in the system. This view should change whenever the PeopleManager’s internal list of people changes. Since PeopleManager was created as some shared state inside AppDelegate, and we will use dependency injection to pass the PeopleManager to the MainViewController, setting the MainViewController as the delegate of the PeopleManager will break any other code that also wants to be the delegate of the PeopleManager. This is easy to imagine – maybe there is a object that broadcasts to the web whenever the list of people changes, and that needs to get a callback from the PeopleManager as much as our front end view controller. Thus, we create a method locally that we want PeopleManager to call whenever its state changes, and we register the specific instance of the MainViewController object and the method we want it to call with the PeopleManager.

Notice that we can definitely use the Target-Action approach if we want one-to-one communication as well, and we can even sidestep the Protocol and register any method of an object as te receiver of some callback. From the Dropbox API:

@interface DBRequest
- (id)initWithURLRequest:(NSURLRequest*)request andInformTarget:(id)target selector:(SEL)selector;

Naturally there are plenty more tricks to getting these things right. Reading code is probably the best way to learn how to structure large programs, but this here is a start. Hang your state off of your AppDelegate, use dependency injection to have state be accessible, set up objects as delegate of the things they spawn, and register objects as targets of events that happen. Access your stateful libraries through singletons and Boom! You’ve got yourself a maintainable, testable iPhone/Cocoa app.

As the experts can probably tell from this post, I am by no means an expert myself, so any feedback is welcomed in the comments!

Nighttime driving

If you google Nighttime Driving you get swamped with lawyers and mothers and policemen all yelling and clawing and shaking, fists and fingers, about how dangerous it is, they’ll save you money, put you in jail. words trying to guilt you or warn you or get your business or screw you over. Can someone please actually go nighttime driving? The woods at night, illuminated by only your headlights, each corner most definitely revealing a magical (nightmarish?) wonderful anxiety of the unknown, quickly blasting through to the next corner, the next slight variation of that same feeling.

It’s 2 in the morning up on skyline at 2, after rain, window open a crack and if you’re lucky you hear coyotes in the distance. Phantoms around the next corner… or the next? or the one after? Too excited to turn back, too scared to keep going, but the car is running well and the road is there for the taking. Let them babble, I’ll throw it all in their faces.

I’m reading Kerouac:

“it comes over me in the form of horror of an eternal condition of sick mortality in me – In me and everyone else – I left completely nude of all poor protective devices like thoughts about life or meditations under trees and the “ultimate” and all that shit, in face the other pitiful devices of making supper or saying “What do I do now next? chop wood?” – I see myself as just doomed, pitiful – An awful realization that I have been fooling myself all my life thinking there was a next thing to do to keep the show going and actually Im just a sick clown and so if everybody else”

What rebuilding an engine (re)taught me about software engineering

Over the last month I have stripped down and completely rebuilt my the engine in my Subaru WRX. Throughout the process (and with tons of help from my Dad, NASIOC and Tom Weimer) we went through many iterations of trying something, finding that it didn’t work and having to backtrack and try again. This single aspect of the rebuild made it an incredibly frustrating experience – having to take apart what you just built sucks, especially since it usually happens because we got the order wrong or forgot one small little piece. There’s nothing that turns around a mood quicker than realizing you were wrong and now you have to no just do it again, but recover your mess-up before you even get another shot. This same situation come up a lot in writing code for research.

Good engineering involves building complicated systems by breaking complicated processes into smaller cooperative blocks. This says nothing about the process to actually build a system though. Just like rebuilding the car, you end up replacing these smaller blocks multiple times as the project goes on. It is frustrating to have to rewrite code multiple times, but you improve it every time you rewrite. That’s why it’s worth doing it. The magic of abstraction allow us to have these independent blocks.

I feel like I knew this about software engineering and writing code. It is frustrating and annoying to rip out something you wrote yesterday and redo it, but creating perfection is a long series of small steps and a lot of hard work. It’s much less frustrating when you expect stuff not to fit and having to be replaced/-written, and the instantaneous feedback of doing real mechanical things drove home this point very deeply. I’m exited to get back and write some gpu and compiler code!

reverse-i-search: Quicksilver for Bash

Solomon Boulos just pointed out the most useful bash feature I’ve found so far. Reverse-i-search! Like the emacs functionality, it autocompletes the command you’re typing in according to your history file, in the order of your history file.

Access this magic by hitting “Control-R”. Start typing a command and it will autocomplete it for you. Keep hitting Control-R and it will cycle through all possible matches. Incredibly useful! It’s like using ! except you can see what it’s going to execute.

Running CUDA without running X

I want to run multiple graphics cards on CUDA without starting X, since I have a bunch of GPUs sitting in a headless box. Since the GPU drivers does not get loaded without X running (or, in ubuntu, only some of the cards gets loaded), I put together a init.d script that brings up all the nvidia GPUs on your Ubuntu 10.04 box:


case $COMMAND in

 if [ "$COMMAND" = "restart" ] || [ "$COMMAND" = "stop" ]; then
 NVIDIADEV=`ls -l /dev/nvidia* | awk '{if ($9 != "/dev/nvidiactl") a+=1}END{print a}'`
 NDEV=`expr $NVIDIADEV - 1`
 for i in `seq 0 $NDEV`; do
 unlink /dev/nvidia$i
 unlink /dev/nvidiactl

 if [ "$COMMAND" = "restart" ] || [ "$COMMAND" = "start" ]; then

 modprobe nvidia

 if [ "$?" -eq 0 ]; then

 NVGA=`/usr/bin/lspci | grep VGA | wc -l`

 N=`expr $NVGA - 1`
 for i in `seq 0 $N`; do
 mknod -m 666 /dev/nvidia$i c 195 $i
 mknod -m 666 /dev/nvidiactl c 195 255


 echo "$COMMAND is not supported on this job."


Black? No, with milk and sugar. A civilized drink, by all accounts. Sweet, aromatic, only with the finest freshly ground beans. I can taste the difference, I think. At least, I can’t help but wrinkle my nose and suppress the urge to dump the black tar that comes out of the office thermos. So surely, yes, I can taste the difference. Then again, it might just be the circumstances – after midnight anything with caffeine will taste better. Although it’s a fine drink to start your day with, something about coffee makes it a drink of the night. When your desklamp and a monitor is in stark contrast to the blackness outside, the drink somehow feels at home. It’s content to sit in my cup, lazily swirling as I type. In the morning it’s a swallow-and-go experience, almost as if it wants to disappear out of the sunlight. But at night it languishes, a stray cat outside your window – not a friend, but a presence, a fog through which the rest of the world loses shape. Opportunities bristling with anticipation to be pounced on, yet happily waiting until the cup is once again standing between printouts and cables – knowing full well that the cup touching the table starts the inevitable recession of those very opportunities so close just a second before. Nothing to do but pick up the warm ceramic again. Its not a dependence, that would imply a subjugation of myself – rather, it’s the feeling of an old friend I have a brief chance to confer with. I let life stream by for a little while, conversing, bringing up memories, making plans. Energized after this brief interlude I walk on – no plans to…

I digress, the cup is now empty, the visit is over, and Haskell doesn’t learn itself.

Creating a SVN-Like Centralized Git repository

There’s a simple way to create a central git repository:

git init --bare --shared

This will create a repository that you cannot commit data to if your commit will cause any merge conflicts (it forces you to make only additions) and it has permissions set up to allow multiple users to access the repo. The config, as of git 1.7, looks like this:

	repositoryformatversion = 0
	filemode = true
	bare = true
	ignorecase = true
	sharedrepository = 1
	denyNonFastforwards = true

Now you can clone this repository easily!

If you want to take a current local repository and create a remote central repository:

First, we clone the local repository into a new, bare repository:

cd tempdir
git clone --bare /path/to/original/repo

Then we add the denyFastForwards option and add the shared option:

git config recieve.denyNonFastforwards true
git config core.sharedrepository 1

We can now copy this new repository to whatever server we want and set all that up. Lastly, we want our original repository to point to this new repository as its primary remote repository. For this we have to set up some remotes and the like:

cd /path/to/original/repository
git remote add origin git+ssh://server/path/to/remote/repo
git config branch.master.remote origin
git config branch.master.merge refs/heads/master

The last two commands sets up the local repository to track the remote repository by default, so that it will be the repo that responds to get push and git pull.

This blog post was inspired by this, this, this and this. And my general frustration of not having all this in a little script.

New hardware! 32 core box!

Our new server just arrived! Quad socket 32 core box, dope shit! Fresh from Intel, you can hardly buy this kind of hardware.

Haskell Idioms (Part 1?)

I’m taking Stanford’s Programming Languages class, and I’m back in the happy fuzzy world of functional programming that started my computer science education back at Berkeley with Prof. Brian Harvey’s CS61A. Except this time around we’re doing it in Haskell!

Haskell, so far, has a couple of cool features that’s jumped out at me. (Features is not really the right word, rather, design decision?):

  • Lazy Evaluation – this is the big, obvious one that really influences the idioms you use to write code
  • Pattern Matching that’s so ridiculous that if I haven’t been writing Scala I would be completely lost, now I’m only 99% mind-blown.
  • Cutesy math syntax (for example, list comprehension looks just like math notation: [x * 2 | x <- originalList]
  • Multi-Branch functions. AKA, you can define multiple versions of a function (like overloading) and the correct one will be taken. Sounds like overloading, but it blows your mind combined with laziness, of which I’ll give an example in a second
  • Functions are either curried or pattern-matched. So, pretty much all my functions can be partially applied.
  • Convert prefix operators to infix using “ and infix operators to prefix using ().  Neato
  • Type Inference. I’m a huge believer in static typing with type inference these days. Eating cake while having cake is the best.

OK, that was a boring lame list of stuff that I’ll probably skip whenever I read this blog post. Notice how I blog to remember stuff.

Multi-Branch function example:

listLength [] = 0
listLength (x:xs) = 1 + listLength xs

That’s just looks cool. Straight out of a math textbook.

Helper Functions Idiom:

Haskell (obviously) has nested functions, so you can write functions using local helper functions and accumulators inside the function’s scope. For example, you can write reverse in the “simple” functional manner like so:

reverse [] = []
reverse (x:xs) = (reverse xs) ++ [x]

Using, naturally, the cons operator (:), the concatenate operator (++) and the list syntax []. This is an inefficient algorithm, since it sweeps the list twice (once going down calling reverse on the cdr of the list, once sweeping back appending elements). Or, you can write reverse sweeping the list only once, using local scope and a helper function:

reverse xs =
  let rev( [], z) = z
      rev( y:ys, z) = rev( ys, y:z )
  in rev( xs, [] )

That’s a whole lot in only 4 lines, but it only sweeps the list once to reverse it. As you can (kinda) tell, it keeps cons-sing (: operator) the first element of the rest of the list to the first element of the new list being built. That is, it puts the “next” element as the “first” element. This is how you reverse a list, if you think about it for a bit. Cool huh!

Multi-Branch Function Idiom using Laziness

Since the compiler picks the function branch that pattern matches (including type matches) what you’re calling the function on, you can write plenty of code using a haskell idiom that would epic-fail in other languages.

The idiom consists of writing a main function that generates (lazily) all possible points in the solution space and exhaustively searches through it, and writing a set of small “filter” branches of this function that guides the search to be efficient. The laziness prevents tons of temporaries, the branches guides the search.

For example, consider substring matching. You can say that finding whether a string is a substring of another string is the same as generating all suffixes of the string and checking whether your string is a prefix on any of these suffixes. Let’s just write that:

x `isSubString` s = or [x `isPrefixOf` t | t <- suffixes s ]
suffixes [] = [""]
suffixes (x:xs) = (x:xs) : suffixes xs

mind blown! zomg! So, on the first line we say, x is a substring of s if x isPrefixOf t evaluates to true for all t in suffixes of s. then we define suffixes of the empty list as the empty string, and suffixes of a non-empty list as that list, cons’ed to the suffixes of everything but the first character of the input list. And that’s all you need, it won’t generate all options, it’ll lazily evaluate things, pattern-matching the type of list to the branch of the function, going nuts.

Define your own conditionals!

Since Haskell is lazy, you can write your own if statement or boolean operator and it’s just like the real thing! In fact, you can write your own eval function in about 10 lines.

Infinite data structures!

Of course, this just comes with laziness. Just implement prime number searching or whatever.

Lots of other cool stuff in here.

GTC 2010 Trip Report

I spent the week of September 20th though 23rd at NVIDIA’s GPU Technology Conference, and I’m now reporting back on some of the events that particularly interested me. I went to a whole range of talks, varying wildly in subject and quality, so I will gracefully ignore some of the talks while focusing on events that relate to my research or particularly struck my fancy.

As is usually the case with these conferences, I find that the most valuable experience is meeting the community of researchers around the world. Talks, papers and the like can be read by yourself, but conversation and networking at these conferences makes the visit absolutely worth going. GTC was no different, and I connected with several people that I aim to stay in touch with as research colleagues. This also seems to be an apt place to start my trip report! So, I met the following people during the week:

  • Tobias Brandvik (Cambridge), doing stencil abstractions for fluid flow solvers on multi-GPU clusters
  • Bryan Catanzaro (UC Berkeley), building Copperhead, a data-parallel python derivative that completely abstracts away the GPU.
  • Vasily Volkov (UC Berkeley), working on low-level performance tuning of GPUs and breaking down the CUDA abstractions.
  • Jason Cohen (NVIDIA), part of the Parallel NSight debugger team (Rick Shane from Adobe introduced us)
  • Nathan Bell (NVIDIA), the other half of the Thrust developer team (Jared Hobernock introduced us)

I will now attempt to distill the most interesting talks I attended to their core observation, and any notes I found especially interesting.

Opening Keynote

Jen-Hsun Huang, NVIDIA CEO

Jen-Hsuan was very much pushing on the importance of parallel programming, reiterating the arguments about the power wall the industry has hit, and pushing the fact the NVIDIA has been building highly parallel hardware for years now. The Senior Vice President of Content and Technology, Tony Tamasi, shows off several demos of Fermi’s tessellation capability (an endless city, procedurally generated, and tessellated on the GPU to give the equivalent of a 100 billion triangle scene). He moves on to the physics simulation capabilities of these GPUs by showing a re-imagination of Ron Fedkiw’s lighthouse scene running in real-time. A multi-grid height field combined with particles give real-time water, while flotsam is simulated as rigid bodies. All three simulations are coupled, and run in real time. Although it still looks simulated, it’s definitely ahead of today’s games.

The big statistic here for developers is the rate of CUDA adoption. NVIDIA very much pushes the idea that they have 100 million GPUs out in the field, all that can run CUDA programs. The reality of the situation is, naturally, not nearly this good, but it’s a nice statistic to have. The Folding@Home and SETI@Home people are reporting massively skewed statistics towards people running massively parallel processors, so there’s surely some truth to these numbers.

NVIDIA accounced CUDA-x86, a new compiler from PGI that allows compilation from CUDA code to x86 code, allowing developers to write programs that runs on multicore processors or throughput-based GPUs. In my mind this is just a nice-to-have, since none of the serious optimizations you do for the GPU (think coalesced memory accesses, specific thread groupings to exploit vector lanes and vector lane synchronization) will carry over to x86, and might even hurt performance (cache misses being caused by GPU-focused optimizations). Still, the write-one-run-anywhere dream is clearly very important, which is great for the research I’m working on.

Several other impressive demos were also shown: Dr. Black’s beating-heart-surgery that tracks a heart in real time to make incisions with a robotic scalpel, Abobe’s David Salesin showing off refocusing by using Plenoptic Lenses (originally done by Ren Ng from Stanford) and the iRay photorealistic raytracer running on 64 Fermi’s, rendering to your web browser at interactive rates. Clearly graphics has lots of evolution left as it enters the world of massively distributed computing.

Lastly, NVIDIA announced that their next two chips – Kepler and Maxwell – will have those codenames, and will aim for 3 times and 10 times the performance per watt of today’s Fermi’s.

A Fast, Scalable High-Order Unstructured Compressible Flow Solver

David Williams & Patrice Castonguay (Stanford)

I was curious to find out how this group built their flow solver to run on a GPU cluster. Since this is an example of what we’d like Liszt (our research language) to be able to do, so seeing a hand-written version was profitable. They followed the same MPI ideas as is generally used – partition your unstructured mesh and create ghost cells for the data you want to share across partition boundaries, placing a partition on each machine. They implemented their algorithm using a gather approach: The GPU would perform two stages of work, a first stage to calculate cell-based values, and a second stage to reduce these values to edge-based values. The synchronization between these two stages would include the MPI all-to-all step to resolve ghost cell values.

Since they wrote a specific instance of a RANS algorithm, they did not do any scheduling work or fine-grain synchronization, their two-stage gather was enough to run the algorithm. They were getting good linear speedups on their cluster, and managed to achieve a sustained 1.3 Teraflops on a cluster of 16 GPUs using a mesh of 320 000 cells.

New Programming Tools GPU Computing

Sequoia, Copperhead, GMAC, Thrust

Unfortunately, panel discussions with 3 minute introductions for each project is never enough to really understand any of the projects. The most striking part of this panel was the obvious programming language direction researchers have taken. Except for Thrust (although it can be considered an embedded domain specific language) all the work has a programming language spin on it. The major concern of the audience was clearly the support issues and feature-scarcity of new programming languages, which the different projects addressed differently – Sequioa tries to be a runtime more than a full language, Copperhead attempts to be deeply coupled to Python, Thrust passes itself off as a library and GMAC aims to be language-agnostic, creating a universal address space between accelerators (GPUs) and processors that any language can take advantage of.

PyCUDA (2041)

Andreas Klockener

Andreas’ PyCUDA talk was mostly an introduction to PyCUDA and a brief overview of how it works and the motivation behind it. I found this talk especially interesting, since he took an approach very similar to the way web frameworks use templates to generate web pages. Kernels in PyCUDA are strings of text, with embedded Python variables that is replaced when you ask his framework to compile the kernel. He built this JITting engine as an extention of Python, allowing you to write kernels at runtime and pass it off to the nvcc compiler to generate CUDA code. I liked the fairly low level control he allows you to achieve inside of Python, but PyCUDA does not attempt to abstract away CUDA or the GPU. It is, rather, very similar in spirit to the Boost Python bindings – allowing you to build software in Python, and rewrite the slow parts in C (or CUDA), calling from Python these low-level functions directly. PyCUDA has the added benefit that you do not even need to leave the Python interpreter. His whole approach was fascinating, especially since this is what I would have done were I faced with a similar problem, given my web framework experience. Andreas likens this to the LISP-style metaprogramming that’s been around since the 60s – manipulating string kernels, “pasting” in values on the fly.

PyCUDA in general is built to interface tightly with numpy and scipy, two Python packages that supply matlab-like functionality to Python users. PyCUDA does not attempt to address the type inference issue of moving from a dynamically typed language to a statically types one, since it depends on the user to write kernels with the correct types, and on numpy to supply runtime types of the multi-dimensional arrays that PyCUDA works with. Copperhead, Bryan Catanzaro’s data-parallel version of Python, abstracts away the GPU entirely, thus it has to deal with type inference, and he built a Hindley-Milner style type inference system into Python to handle this. Copperhead is built on top of PyCUDA, so he uses the JITting capabilities of PyCUDA to get to the GPU – a great decision in my mind, since someone else is now responsible for the low level details of catching errors and generating kernels.

Better Performance at Lower Occupancy (2238, Wed 15:00)

Vasily Volkov

(Slides here) Vasily has published several papers on understanding GPU hardware and tuning codes for the GPU, and his talk addressed the focus on massive multi-threading of GPU apps, showing the Instruction Level Parallelism is still a very important approach for the GPU. In the process of demonstrating this, he disproved several of NVIDIA’s claims in their Programming Guide. This talk was very interesting to me, since it addressed many of the low level architectural questions myself, Kayvon, Solomon, Zach and Jorge has discussed in detail. The use of the word “occupancy” in this talk refers to the percentage of threads spawned out of the total supported number on the multiprocessor.

The general recommendation to hide latencies is using more threads per block and more threads per multiprocessor. Vasily demonstrates that faster codes tend to run at lower occupancies, citing as examples the differences between CUBLAS and CUFFT versions – every performance improvement came with a lowering of threads per block. Vasily shows in the talk how to hide arithmetic latency and memory latency using fewer threads, and get a total performance increase with fewer threads. He also attempts to disprove the fallacies of shared memory being as fast as register files, addressing the bandwidth differences between the two.

The heart of his talk is the fact that Streaming Multi-Processors are still pipelined machines, regardless of the multi-threaded wide-vector-lane nature of these processors. By writing your code as sets of independent operations, keeping data dependencies to a minimum and structuring code to keep the pipeline filled, you can get massive performance regardless of the machine’s occupancy. He shows the roofline model for a simple SAXPY code, and how he can influence the memory-bound part of the model by doing multiple independent SAXPY operations in a single thread (since all but one of the input values are the same in a SAXPY operation). He continues to show that he can get 87% of the peak bandwidth available to an SMP at only 8% occupancy (while cudaMemCpy achieves only 71% of peak). Lastly he makes the point that the banked nature of shared memory makes it impossible for shared memory codes to achieve full bandwidth utilization. This leads to the recommendation to use as few threads using as much registers as possible.

The attention to detail in this talk, as Vasily played with the limits of the GPU, allowed him to break down several of the “nice” abstractions CUDA provides.

Large-Scale Gas Turbine Simulations on GPU Clusters (2118, Wed 16:00)

Tobias Brandvik

Tobias and his group at Cambridge has addressed a very similar problem as we have with Liszt. They want to write a mesh-based simulation system that runs on today’s high performance machines, without having to rewrite the code for each architecture. Specifically, they are building a production-quality solver for use in the Aerospace industry. Their approach has many overlaps with Liszt, targeting GPU clusters while avoiding the need to rewrite all their code for every variation of today’s heterogeneous machines. In contrast to our approach, they work at a fairly low level, since they only attempt to rewrite the approximately 10% of their code base that is stencil-based calculations. A stencil is defined as mesh accesses and data read/writes a kernel will perform, which allows them to generate specialized CUDA source code for each stencil in their application. This 10% of the code is roughly responsible for 90% of the run time, and can be abstracted as math kernels running specific stencils across a mesh.  The cluster aspect of their code is still hand written MPI code, but rather than write GPU or SMP specific codes that runs on an MPI node, they use these stencils and math kernels.

In terms of domain-specific optimizations, he referred to the 2008 Supercomputing paper by Datta et al that showed a set of optimizations to run stencil codes at high performance on GPU devices. They attempted to implement these optimizations as part of their source-to-source compilation process for kernels.

Their approach requires the programmer to hand-write the stencil and the math kernel. This approach allowed them to embed this stencil language partially inside fortran. They then took their current simulation system (TBLOCK, approximately 40kloc in Fortran) and factored out the stencil-based calculations into separate stencil definitions and math kernels. This allowed them to keep most of their current simulation code while spending “a couple of months” (Tobias’ words) on rewriting calculations that fit this stencil scheme into their embedded language with accompanying stencil definition files. Their system, called TurboStream, has on the order of 15 different stencils in it, with 3000 different stencil definitions, and they run simulations on a 64-GPU cluster at the University of Cambridge.

Tobias made an interesting comment during the talk, saying that their biggest concern is not pure speed, since their solvers are relatively simple, but that they want to be able to handle much more data than they currently do – this was their biggest motivation for moving to GPU clusters. Per way of example, he showed the fine detail of turbofan fins spinning through slits cut in the housing of a jet engine, and the fine-grain simulation details around these slits – geometry that was previously ignored.

Overall Impressions

The biggest gain of the conference was the networking with several other researchers, and getting an overall view of the field as several groups attempt to solve the same problem – how do we write codes that can run on all of today’s hardware choices.

I find myself using OpenTerminal a lot – mostly to open a terminal in a directory, followed by “mate .” to open a textmate project in this directory. This quickly becomes annoying, so after looking into AppleScript, I took the plunge and wrote my first AppleScript. what a weird language. Anyway, you can dump this into AppleScript Editor, and when you run it, it opens a textmate project of the front-most finder window:

on run
tell application "Finder"
set frontWin to folder of front window as string
set frontWinPath to (get POSIX path of frontWin)
tell application "TextMate"
open frontWinPath
end tell
on error error_message
display dialog error_message buttons {"OK"} default button 1
end try
end tell
end run

Save this as an Application (not a Script), and drag it onto your finder toolbar. Voila! TextMate at your fingertips.

Thanks you Mac OS X Hints, from where I got the pattern to do this.

Comparing floating point numbers for equality.

Everyone who’s taken a architecture course (or messed around with scientific computing) knows that floating point numbers are not associative. That means mathematically:

a*(b+c) \neq a*b+a*c

Or, in layman’s terms:

The order of operations influences the result of the calculation

This implies that floating point calculations that mathematically give the same answer, does not necessarily produce exactly the same floating point number. So, when comparing two floating point results, using a == b will not give the correct result.

You can attempt to remedy this by using the mathematical approach of allowing an absolute error metric:

(a*b)^2 < E

which does not account for the fact that floating point numbers are unequally distributed over the real number line. We can attempt to use a relative error metric:

\frac{|a-b|}{b} < E

but this does not take into account the difference between very small positive and negative numbers (including positive and negative zero, since floats have both).

So, from the very enlightening “comparing floats” article by Bruce Dawson, we try something quite different.

Floats can be lexographically ordered if you consider their bitstream to be signed-magnitude integers. We can exploit this fact to calculate exactly how many representable floating point numbers there are between two floats. So, for example, we can find that there is only one floating point number between 9,999.99999 and 10,000.00001 and use an error metric that states “I will consider floats to be equal if they are within E representable floats of each other.

The details of this routine is in the comparing floats article, but I will mirror the code here:

// Usable AlmostEqual function

bool AlmostEqual2sComplement(float A, float B, int maxUlps)
    // Make sure maxUlps is non-negative and small enough that the
    // default NAN won't compare as equal to anything.
    assert(maxUlps > 0 && maxUlps < 4 * 1024 * 1024);
    int aInt = *(int*)&A;
    // Make aInt lexicographically ordered as a twos-complement int
    if (aInt < 0)
        aInt = 0x80000000 - aInt;
    // Make bInt lexicographically ordered as a twos-complement int
    int bInt = *(int*)&B;
    if (bInt < 0)
        bInt = 0x80000000 - bInt;
    int intDiff = abs(aInt - bInt);
    if (intDiff <= maxUlps)
        return true;
    return false;


This has saved me huge amounts of headaches comparing CUDA and CPU generated results for our CUDA programming class, CS193G.

Compiling the pbrt 1.04 raytracer on mac OS X 10.6

I’m taking Prof. Pat Hanrahan’s CS348B “Advanced Rendering” course this quarter, and we’re extending the pbrt renderer as part of the course assignments. It’s probably worth documenting how I compiled this on my Snow Leopard machine.

After downloading and extracting pbrt 1.04 from the pbrt downloads page I had to install OpenEXR using MacPorts:

sudo port install OpenEXR

MacPorts installs libraries like this one in /opt/local/ to prevent conflicts with libraries from other sources (it has a handy pkgconfig directory for each library in /opt/local/var/macports/software/.../lib/ that is full of info). We need to update pbrt’s makefile to point here. We modify lines 13 and 14 in the Makefile to read:


You should now be able to make the directory and produce pbrt. Remember, you need XCode installed!

Now you need to set the PBRT_SEARCHPATH environmental variable. I did this the easy way and cd‘d to the pbrt bin directory, and ran:

export PBRT_SEARCHPATH="`pwd`"

Installing Numpy and SciPy on Snow Leopard – the easy way

Just a quick note on the easiest way I’ve found to install NumPy and SciPy on Snow Leopard. It can be quite a pain, since you have to leave the Apple python install alone, even though it includes NumPy by default. It does not support SciPy, and anyway it’s not the latest.

So, don’t try to build from SVN and all that fancy stuff. Just do:

1) Download the latest Python from
2) Download the numpy dmg from sourceforge
3) Download the scipy dmg from sourceforge

Install in that order and move on.