Scrolling Sites on iOS – mobile apps requires something other than HTML/CSS

UPDATE: My fix for this issue was using webkit’s translate3D transform to sidestep the performance issues (for the most part). Here’s a nice performance comparison by Paul Irish. There’s 3 hacks to make it work:

1) you have to use translate3D, not translate2D, for the iPad’s hardware acelleration to work.

2) You have to turn -webkit-backface-visibility to hidden if your elements are larger than the screen

3) You want to set up a default -webkit-transform: translate3d(0,0,0); inside your CSS to prevent a jerk on first touch.

Here’s my code that implements this, while unfortunately disabling momentum for scrolls.


HTML/CSS is arguably the most successful domain specific language we have – it makes it dead simple to write applications containing graphical elements. I claim that it’s an instance of using a high level language (CSS and HTML) to get high performance programs (web apps) by embedding domain specific knowledge into the runtime (document layout, text rendering, interaction events).

Now consider the issues of doing a scrolling narrative with pinned elements & parallax scrolling.

The HTML/CSS abstraction is not a good fit for the layout & positioning requirements of this, which I will try to demonstrate. As you scroll along, you have imperative JS code tracking your scroll, and this fires off changes to the declarative CSS language.

Part of the annoyance is the difference in expression – three languages involved in this process – and the fact that crossing these barriers have performance implications, especially on mobile (this claim is worth testing)

On mobile, here’s the event process:
- the user starts scrolling (1)
- all JS execution is paused
- the user stops scrolling
- JS execution is resumed, receives a scroll event, checks where the page are
- if the page has scrolled past a predetermined point (this in itself is hairy to calculate) we now change the css for a div from position:relative to position:fixed (2)
- the css takes a moment to reconfigure and redraw the display
- the image jumps to a new location since the screen has moved in the meantime
- now as we scroll, the image stays at a fixed location on screen, moving relative to the document (3)
- the user keeps scrolling, past the bottom of the element’s pinned region.
- the css for the div is now updated to a new position

Naturally a big issue here comes in with iOS pausing JS execution, but even without that you get shuddery twitchy behavior, because the high level API that is supposed to give you higher performance since it embeds domain specific information into the runtime is being misused here since it has no way of expressing these constraints.

Potential solutions

Going off the observation that we cannot express these constraints in CSS, perhaps we should be able to express a constraint-based CSS rule for this. Something along the lines of “position of image div is equal to top of text if viewport top is higher, or top of viewport, or bottom of text if viewport is lower”.

This has several concerns for a constraint-based approach: these are not linear or quadratic constraints, they are turned on and off. That is, as the location of the viewport changes, the constraint on the image div changes – initially it is in line with the top of the text, then it becomes equal to the position of the viewport on the document, then it becomes equal to the bottom of the text block.

This means we would have to construct a function as the solution to the constraint function, given the current viewport location, would determine the locations of elements. I believe this would be challenging to do well, and warrants further exploration. This is also the same problem that comes up in automatic document layout using templates! People like Greg Badros and Alan Borning’s Constraint Cascading Style Sheets for the Web, and the work done on Adaptive Grid-Based Document Layout had to solve similar problems (either through prioritizing different constraints or by discretizing the problem and using a big offline dynamic-programing solver). Perhaps someone has addressed this issue in the constraint literature already, and it’s just waiting for us to use it?

A different direction of thinking would be to get rid of the declarative CSS/HTML manner of expressing this. What would be a more appropriate low-level approach? We can find inspiration here in thinking about how this same behavior would be implemented in a native iOS app. For the moment I’d have to leave this for future pondering.

Conclusion

I’m becoming more convinced that, rather than piling features into HTML and CSS, we can come up with a lower-level abstraction that has many of the same benefits (easily expressible graphical applications) but with an API better matched to the applications we’re building rather than static document layout. The declarative nature of CSS makes it more challenging to build this as an extensible system, but I think the gap between “mobile webapps” and “native apps” can be shrunk if we reimagine CSS. It’ll make my work a lot easier!

A conversation about Quality in Manufacturing, and American Made

My correspondence with friend and fellow compatriot of the motorcycle society seemed interesting enough to warrant a blogpost. Forgive any appearance of self-indulgence this might conjure up, that is not the intent. Rather, I hope to shed some light on a difficult issue we have been grappling with. In a society built around disposability, what is Quality and how do we sell it?

This was written in response to Quality is Back In Demand, and the work that These Men Are Professionals are doing to produce Quality and Unique wares.

Pags,

It’s with great sincerity that I write to you about my deepset connection to the words “American Made”. It’s difficult to even type it without the appearance of a lump in my throat, and a feeling of great ability rising to meet great expectations – early dawn over an unexplored horizon.

My first real experience with the concept of Quality in Manufacturing came from a Maglite flashlight I received as a birthday gift at a young and impressionable age I now fail to recall the exact date of. The box proudly displayed the American flag in the bottom right corner, capped with ‘Made in the United States of America’. It had a sense of dignified confidence about itself – there was no explanation of why this sentence had any significance, yet its presence was anything but frivolous. I was too young to understand why the emblem mattered, but my Grandfather, whom I deeply respect and admire, pointed it out to me and said “Everything made in America will last you forever. This is the yardstick by which quality is measured.” Grandpa did not use such language lightly, and the sleek black metal flashlight’s weight in my hand cemented the imagery he conjured of Serious Men making Serious Tools for Serious Jobs. A Fraternity I suddenly wanted to be part of.

I still have that flashlight, many years later, even though it’s no longer as bright and efficient as the new LED ones (of which I own several), but every scratch tells a story – the time I dropped it on the pavement while camping, trying to replace the lightbulb and having to crawl around feeling for the screw cap in the dark – and I wouldn’t want it any other way.

In the times since then, I’ve grown to see that (unfortunately?) not all American Manufacturing is the pinnacle of quality and perfection. There’s plenty of subpar, even terrible manufacturing happening right here in the US, and not just in the last three decades of rising pressures from China and a growing disposable-minded society. The American invention of the paperback novel, as an example, goes back to the 1950s, and was centered around extremely cheaply printed books for mass circulation, where pages start falling out after only a couple of reads. Who cares, it was 99 cents! If you want quality paper printing, talk to the Japanese!

But I digress, let’s get to the real conversation – not about dreams of days gone past, where perhaps the naivety of youth and sepia tones of time distorts our recollections in a overly fond direction. Let’s talk about today.

Any lofty conversation about manufacturing Quality Products is in danger of quickly become either very preachy (“You need to stop buying trash and start buying quality”, stop telling me how to live my life), very insincere (“buy now and we give you a free 30 day money-back guarantee!”, too many late night ads), or very whiney (“the past was so much better yadda yadda yadda”). Each of these alienates people, while the real message is one of inclusion and hope.

When it comes to your blog post, since you asked about it, I think you’re doing a good job treading the fine line between those, although I do caution you to be careful! Write with inclusion in mind. Take us all along on the journey and keep sharing your excitement for it. Those that differ from you are not enemies, they’re potential customers and disciples waiting to be converted.

A couple of observations around selling Quality: Amazon (or any other big online store) gives me absolutely nothing to distinguish between the quality of different products. Price should not be the only representation of Quality, but when I buy books off the internet, I have no idea that the $20 hardcover has heavy, bright paper and perfect margins while the $12 paperback is printed on what couldn’t even serve as toilet rolls. All I see is two identical front covers and a $8 price difference.

Selling Quality online thus demands that you do not put it into direct comparison with the equivalent low-quality product, or at least not initially. If you want to sell the Greatest Hoodie Ever, you create a environment for it where it can stand on its own. Like American Giant with their Hoodie.

You’re already doing this, which I applaud, but I wanted to point it out anyway.

This conversation can be one of Rejecting Others’ Reality and Substituting Your Own, but it can also be one of Building on the Shoulders of Giants. The international supply chain gives us the machines we’re typing this on, at a price we can afford, while keeping up with the incredible scaling that transistors have undergone over the last 4 decades. Maybe we took it too far (quite likely we did!), but let’s consider what we have to build on top of now!

This is a long rambling email, which I apologize for – I’m overworked, underslept and stressed – but I’m excited by what you’re doing and wanted to let you know I believe in this, or at least my take on it.

Wifi on the Raspberry Pi using EDIMAX EW-7811Un and the 2012-12-16 wheezy raspbian image

The EDIMAX EW-7811Un is probably the most popular wifi adapter for the Raspberry Pi. It’s super tiny, doesn’t draw much power, and works (almost) out of the box with the latest Debian images (As of January 2013).

This, of course, doesn’t mean it’s easy to configure. I started off with this guide http://www.savagehomeautomation.com/projects/raspberry-pi-installing-the-edimax-ew-7811un-usb-wifi-adapte.html

But I have my own wpa_supplicant configuration.

Here is /etc/network/interfaces:

allow-hotplug wlan0
iface wlan0 inet manual
wpa-roam /etc/wpa_supplicant/wpa_supplicant.conf
iface default inet dhcp

And here is /etc/wpa_supplicant/wpa_supplicant.conf
This is for WPA and TKIP configuration since I could NOT get WPA2 to work.

ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev
update_config=1
network={
ssid="JOUB"
psk="MY_SECRET_PLAINTEXT_PASSWD"
proto=WPA
key_mgmt=WPA-PSK
pairwise=TKIP
group=TKIP
auth_alg=OPEN
}

if you’re seeing a message saying “WPA: IE in 3/4 msg does not match with IE in Beacon/ProbeResp”, you’re having authentication problems between WPA and WPA2. Be careful, you want ONE of the following:

WPA-PSK:

network={
ssid="my_network"
proto=WPA
key_mgmt=WPA-PSK
pairwise=TKIP
group=TKIP
psk="secret_password"
}

WPA2-Personal:

network={
ssid="my_network"
proto=RSN
key_mgmt=WPA-PSK
pairwise=CCMP TKIP
group=CCMP TKIP
psk="secret_password"
}

Quotes

“To give anything less than your best is to sacrifice the Gift” – Steve Prefontaine

“Cool is conservative fear dressed in black” – Bruce Mau

“He who knows only his own generation remains forever a child.”

A Partial Review of Object Oriented Javascript

Motorcycle shops and parts places for vintage japanese bikes in the Bay Area

So far I’ve had great success with or good recommendations for the following places for service and parts for my Kawasaki KZ650-B2:

  • O’Hanlon Motorcycles for top notch service, advice and parts. Under a $100 per hour!
  • Hayasa Motorcycles in Oakland
  • Golden Gate Cycles for parts
  • Werkstatt if you have to, they have parts and stuff but they’re not very nice.
  • Berkeley Performance Motorcycles & Service
  • Hayward Cycle Salvage if you ‘re okay with digging through boxes
  • Fremont Cycle Salvage, again, if you’re okay with digging through boxes for second hand parts.

Of course there are internet retailers as well, but these are the places to call if you need parts right away.

Pat’s paper-writing formula.

My advisor, Pat Hanrahan, just stated his paper writing formula. Since he’s a God in the graphics world, I’m going to yot this down for future reference:

  • Motivate the problem
  • State the challenge
  • Describe your approach
  • State your contributions

His golden rule for writing good science papers is, be precise, be complete.

The Ideal TextMate Setup.

I’d like to present my TextMate setup as my ideal coding environment:

I’ve only recently discovered that the real secret to TextMate is having the right bundles. I’ve got TextMate configured with the following bundles:

Then, there is a set of utilities I use that’s not actually TextMate, but that I primarily use along with TextMate:

  • fikovnik’s ShiftIt to move textmate & terminal windows around using keyboard shortcuts
  • Divvy for accurate placement of my windows to get 100% screen space utilization.
  • git for source code tracking, it’s unbeatable.
  • rsync for data files (especially those multi-gigabyte binary meshes).
  • git hooks into rsync, so it automatically rsync’s when I push.

X-Macros

This blew my mind and made my eyes bleed at the same time. My convoluted version of X-Macros:

char* foo() { return "foo"; }
char* bar() { return "bar"; }
void baz(char *z, char* a, char* b) { printf("BAZ: %s, %s\n", a, b); }

#define ARGS(_) ,_

#define KERNEL_CALL(a) \
KERNEL_ARGS(PRINTF_MACRO); \
baz("boo" KERNEL_ARGS(ARGS)); \
printf("Now we print %s\n", a);
int main() {
#define KERNEL_ARGS(_) _(foo()) _(bar())
#define PRINTF_MACRO(_) printf("I am printing %s\n",_);
KERNEL_CALL("done");
#undef KERNEL_ARGS
#undef PRINTF_MACRO
}

The biggest problem with this approach is the opaqueness it introduces into your source. You can’t grep for the results of the preprocessor, so if you use X-macros to create functions by concatenating arguments together (which is exactly what they’re good at) then grepping turns up nothing!

Design Principles for Structuring iPhone Apps

One of the biggest hurdles between writing small examples and building large applications is the principles for organizing and structuring code. Learning how to implement complicated algorithms might make you a good computer scientist, and is rigorously taught in schools, but structuring large applications has always been more of a black art to me. Especially in writing code that interacts with the real world, since we now deal with things that happen asynchronously.

Thus, I’m putting together some of the major design principles in iPhone programming. I’ve been writing a music player app over the December holiday, and much of the software design comes from previous projects, reading the iSoul and Dropbox API code, and reading StackOverflow, and here I’m putting some ideas together.

MyAppDelegate

- (BOOL)application:(UIApplication *)application didFinishLaunchingWithOptions:(NSDictionary *)launchOptions {
window = [[UIWindow alloc] initWithFrame:[[UIScreen mainScreen] bounds]];
...
[window makeKeyAndVisible];
return YES;
}

Firstly, Cocoa-based apps all start off with an AppDelegate. This is your main entry point into the world of iPhone and Cocoa apps, and here is our first and only place to store state for the duration of the program (I’m about to extend this idea, so bear with me). The AppDelegate is the entry from where Controllers (handling logic) is launched and long-running state is handled. This leads us to two main ideas:

Model-View-Controller

Cocoa follows the idea that views (displayed to the user) is separate from business logic and data management (written in Controllers) and data representation (Models). This means, your app will contain plenty of controllers responsible for both handling user interaction and handling interaction with data storage and data sources.

A controller can be an object responsible for talking to a REST API, or an object responsible for presenting a view to the user and capturing mouse and keyboard interaction with this view. I personally like to distinguish between these two by calling the former Managers and the latter ViewControllers, but they both fit into the MVC model the same way. I might get some flak on this, because people will disagree with me whether a data source is a model, a controller, or something else. I prefer the controller nomenclature since I tend to consider data managers and view controllers structurally identical.

A model is a concise description of a piece of data. The controller talking to a remote API might return an array of model objects. My code might contain a Person object (a model), a PersonManager (a controller responsible for managing person objects) and a PersonEditViewController (a controller responsible for presenting views that modifies a person object). Why this separation? Because now I can encapsulate the generic person logic into the Person model, the storage logic into the PersonManager, and the front-end logic into the ViewController.

Now I have all these controllers, managing data and presenting views, floating around. Structuring them and storing state brings me to the second idea:

Dependency Injection (and Singletons)

How do we manage global state? Or, in the case of our example, how does the PersonEditViewController work with the PersonManager to get a Person model object, display it to the user, capture the user’s edits, and store the new Person object?

The PersonManager can be made into a singleton object. Now, anywhere in the code, [PersonManager sharedManager] can be called and we can get the instance of this Manager. We can imagine “getPerson” and “savePerson” functions on this instance, and we’re off to the races. This is a great way to structure libraries, since the global library initialization code can be captured in some singleton instance, and we can use this library anywhere. In fact, this is how the heart of the Dropbox API is structured – on app load, a DropboxSession singleton instance is created, which captures the authentication details of your app. Anywhere in your code that needs to dropbox access starts off by getting the singleton instance of the DropboxSession and works from there.

Singletons break down the otherwise nice object-oriented nature of our code and makes testing much harder, since we now cannot unit-test objects with dubby instances of a singleton without modifying the singleton creation code. Singletons are no different from global variables – they break the modularity of your system, and makes code harder to read and harder to reason about. The alternative is dependency injection. Initially we said that our AppDelegate is our only place to store shared state. Clearly, singletons allow state as well, but a more modular approach is using the AppDelegate to place major long-running state. So, we place our PersonManager object (the controller responsible for managing Person model instances) as an instance variable on our AppDelegate, and we inject it into any other controller that needs to access it – PersonEditViewController now has a “setPersonManager:” call or an “init:withPersonManager:” constructor. Structuring code this way makes dependencies perfectly clear and allows for unit testing by passing dummy dependencies into an object.

Now that we have  a way to structure out code in general, we need to manage data flow between these controllers. This leads us to talking about delegation and callbacks.

Delegation and Target-Action (Callbacks)

Both these approaches encapsulate the idea of communication through callbacks. Delegation is the approach of having a specific object – the delegate – be responsible for all the callbacks from the host object. Target-action is an approach where the host object can inform many “delegates” about behavioral changes. Let’s keep this rooted in practice.

Whenever an object creates a new object to do a specific task for itself (for example, a Controller creating a View to present to the user, or a Controller creating a socket connection to a remote api) the one-to-one communication between this worker and the original object can be asynchronously captured by having the main object as the delegate of the worker object. The worker object has a weak reference to its delegate, and can call methods on its delegate to communicate. These methods are captured in a common interface – in Objective-C by using Protocols, the equivalent to Java’s Interfaces. For example, the Dropbox API has a controller responsible for letting the user log into dropbox. You can create this controller at any point, set yourself as the delegate object, and display it to the user. When the login is done, your object’s succes or failure methods gets called:

Your controller:

- (void)didPressLink {
DBLoginController* controller = [[DBLoginController new] autorelease];
controller.delegate = self;
[controller presentFromController:self];
}

- (void)loginControllerDidLogin:(DBLoginController*)controller { [self updateButtons]; [self.navigationController pushViewController:photoViewController animated:YES]; }  - (void)loginControllerDidCancel:(DBLoginController*)controller { }
And, in the DBLoginController, you see things where the delegate gets called:
- (void)didPressCancel { [self setWorking:NO]; [self.navigationController.parentViewController dismissModalViewControllerAnimated:YES]; [delegate loginControllerDidCancel:self]; }

Delegation is especially nice, since you can define many callback methods as part of the protocol, implement the ones you care about, and simply register yourself as a delegate. “worker.delegate = self” makes all the methods you wrote available to the worker. The target-action approach we’re abou to see only connects a single method of yourself to a “worker” (bad nomenclature, sorry!), but everyone gets to join the fun.

So, delegates work great when a single worker is spawned to do something for an object – where some long-running object creates a worker to do something for it – but it does not work in the case where many objects want to know when something asynchronously happens. If we want one-to-many communication, we can’t just have a single delegate. We can still structure communication around a Procotol and setting up a list of callbacks, or use the slightly looser Target-Action approach. Here is the callback approach:

- (void)registerPlayerStateCallback:(id <PlayerStateCallbackProtocol>)callthis {
[_callbacks addObject:[callthis retain]];
}
- (void)notifyCallbacks {
NSEnumerator * e = [_callbacks objectEnumerator];
id <PlayerStateCallbackProtocol> callback;
while (callback = [e nextObject]) {
[callback playerStateChanged:_state];
}

}

These kinds of callbacks (or target-action) works especially well when you have some object that needs to know when long-running state changes. See the difference? When some long-running object spawns a worker to do something asynchronously, we use a delegate. If an object wants to know when (potentially long-running) state changes, the object registers an action of itself as a target of a state change. Yes, the two definitely overlap, but the one-to-one versus one-to-many differentiation helps in deciding which one.

Say, in our original example, a MainViewController shows a list of people in the system. This view should change whenever the PeopleManager’s internal list of people changes. Since PeopleManager was created as some shared state inside AppDelegate, and we will use dependency injection to pass the PeopleManager to the MainViewController, setting the MainViewController as the delegate of the PeopleManager will break any other code that also wants to be the delegate of the PeopleManager. This is easy to imagine – maybe there is a object that broadcasts to the web whenever the list of people changes, and that needs to get a callback from the PeopleManager as much as our front end view controller. Thus, we create a method locally that we want PeopleManager to call whenever its state changes, and we register the specific instance of the MainViewController object and the method we want it to call with the PeopleManager.

Notice that we can definitely use the Target-Action approach if we want one-to-one communication as well, and we can even sidestep the Protocol and register any method of an object as te receiver of some callback. From the Dropbox API:

@interface DBRequest
- (id)initWithURLRequest:(NSURLRequest*)request andInformTarget:(id)target selector:(SEL)selector;
@end

Naturally there are plenty more tricks to getting these things right. Reading code is probably the best way to learn how to structure large programs, but this here is a start. Hang your state off of your AppDelegate, use dependency injection to have state be accessible, set up objects as delegate of the things they spawn, and register objects as targets of events that happen. Access your stateful libraries through singletons and Boom! You’ve got yourself a maintainable, testable iPhone/Cocoa app.

As the experts can probably tell from this post, I am by no means an expert myself, so any feedback is welcomed in the comments!

Nighttime driving

If you google Nighttime Driving you get swamped with lawyers and mothers and policemen all yelling and clawing and shaking, fists and fingers, about how dangerous it is, they’ll save you money, put you in jail. words trying to guilt you or warn you or get your business or screw you over. Can someone please actually go nighttime driving? The woods at night, illuminated by only your headlights, each corner most definitely revealing a magical (nightmarish?) wonderful anxiety of the unknown, quickly blasting through to the next corner, the next slight variation of that same feeling.

It’s 2 in the morning up on skyline at 2, after rain, window open a crack and if you’re lucky you hear coyotes in the distance. Phantoms around the next corner… or the next? or the one after? Too excited to turn back, too scared to keep going, but the car is running well and the road is there for the taking. Let them babble, I’ll throw it all in their faces.

I’m reading Kerouac:

“it comes over me in the form of horror of an eternal condition of sick mortality in me – In me and everyone else – I left completely nude of all poor protective devices like thoughts about life or meditations under trees and the “ultimate” and all that shit, in face the other pitiful devices of making supper or saying “What do I do now next? chop wood?” – I see myself as just doomed, pitiful – An awful realization that I have been fooling myself all my life thinking there was a next thing to do to keep the show going and actually Im just a sick clown and so if everybody else”

What rebuilding an engine (re)taught me about software engineering

Over the last month I have stripped down and completely rebuilt my the engine in my Subaru WRX. Throughout the process (and with tons of help from my Dad, NASIOC and Tom Weimer) we went through many iterations of trying something, finding that it didn’t work and having to backtrack and try again. This single aspect of the rebuild made it an incredibly frustrating experience – having to take apart what you just built sucks, especially since it usually happens because we got the order wrong or forgot one small little piece. There’s nothing that turns around a mood quicker than realizing you were wrong and now you have to no just do it again, but recover your mess-up before you even get another shot. This same situation come up a lot in writing code for research.

Good engineering involves building complicated systems by breaking complicated processes into smaller cooperative blocks. This says nothing about the process to actually build a system though. Just like rebuilding the car, you end up replacing these smaller blocks multiple times as the project goes on. It is frustrating to have to rewrite code multiple times, but you improve it every time you rewrite. That’s why it’s worth doing it. The magic of abstraction allow us to have these independent blocks.

I feel like I knew this about software engineering and writing code. It is frustrating and annoying to rip out something you wrote yesterday and redo it, but creating perfection is a long series of small steps and a lot of hard work. It’s much less frustrating when you expect stuff not to fit and having to be replaced/-written, and the instantaneous feedback of doing real mechanical things drove home this point very deeply. I’m exited to get back and write some gpu and compiler code!

reverse-i-search: Quicksilver for Bash

Solomon Boulos just pointed out the most useful bash feature I’ve found so far. Reverse-i-search! Like the emacs functionality, it autocompletes the command you’re typing in according to your history file, in the order of your history file.

Access this magic by hitting “Control-R”. Start typing a command and it will autocomplete it for you. Keep hitting Control-R and it will cycle through all possible matches. Incredibly useful! It’s like using ! except you can see what it’s going to execute.

Running CUDA without running X

I want to run multiple graphics cards on CUDA without starting X, since I have a bunch of GPUs sitting in a headless box. Since the GPU drivers does not get loaded without X running (or, in ubuntu, only some of the cards gets loaded), I put together a init.d script that brings up all the nvidia GPUs on your Ubuntu 10.04 box:

#!/bin/bash

COMMAND="$1"
case $COMMAND in
start|stop|restart)

 if [ "$COMMAND" = "restart" ] || [ "$COMMAND" = "stop" ]; then
 NVIDIADEV=`ls -l /dev/nvidia* | awk '{if ($9 != "/dev/nvidiactl") a+=1}END{print a}'`
 NDEV=`expr $NVIDIADEV - 1`
 for i in `seq 0 $NDEV`; do
 unlink /dev/nvidia$i
 done
 unlink /dev/nvidiactl
 fi

 if [ "$COMMAND" = "restart" ] || [ "$COMMAND" = "start" ]; then

 modprobe nvidia

 if [ "$?" -eq 0 ]; then

 NVGA=`/usr/bin/lspci | grep VGA | wc -l`

 N=`expr $NVGA - 1`
 for i in `seq 0 $N`; do
 mknod -m 666 /dev/nvidia$i c 195 $i
 done
 mknod -m 666 /dev/nvidiactl c 195 255

 fi
 fi
 ;;

*)
 echo "$COMMAND is not supported on this job."
 ;;
esac

Coffee

Black? No, with milk and sugar. A civilized drink, by all accounts. Sweet, aromatic, only with the finest freshly ground beans. I can taste the difference, I think. At least, I can’t help but wrinkle my nose and suppress the urge to dump the black tar that comes out of the office thermos. So surely, yes, I can taste the difference. Then again, it might just be the circumstances – after midnight anything with caffeine will taste better. Although it’s a fine drink to start your day with, something about coffee makes it a drink of the night. When your desklamp and a monitor is in stark contrast to the blackness outside, the drink somehow feels at home. It’s content to sit in my cup, lazily swirling as I type. In the morning it’s a swallow-and-go experience, almost as if it wants to disappear out of the sunlight. But at night it languishes, a stray cat outside your window – not a friend, but a presence, a fog through which the rest of the world loses shape. Opportunities bristling with anticipation to be pounced on, yet happily waiting until the cup is once again standing between printouts and cables – knowing full well that the cup touching the table starts the inevitable recession of those very opportunities so close just a second before. Nothing to do but pick up the warm ceramic again. Its not a dependence, that would imply a subjugation of myself – rather, it’s the feeling of an old friend I have a brief chance to confer with. I let life stream by for a little while, conversing, bringing up memories, making plans. Energized after this brief interlude I walk on – no plans to…

I digress, the cup is now empty, the visit is over, and Haskell doesn’t learn itself.

Creating a SVN-Like Centralized Git repository

There’s a simple way to create a central git repository:

git init --bare --shared

This will create a repository that you cannot commit data to if your commit will cause any merge conflicts (it forces you to make only additions) and it has permissions set up to allow multiple users to access the repo. The config, as of git 1.7, looks like this:

[core]
	repositoryformatversion = 0
	filemode = true
	bare = true
	ignorecase = true
	sharedrepository = 1
[receive]
	denyNonFastforwards = true

Now you can clone this repository easily!

If you want to take a current local repository and create a remote central repository:

First, we clone the local repository into a new, bare repository:

cd tempdir
git clone --bare /path/to/original/repo

Then we add the denyFastForwards option and add the shared option:

git config recieve.denyNonFastforwards true
git config core.sharedrepository 1

We can now copy this new repository to whatever server we want and set all that up. Lastly, we want our original repository to point to this new repository as its primary remote repository. For this we have to set up some remotes and the like:

cd /path/to/original/repository
git remote add origin git+ssh://server/path/to/remote/repo
git config branch.master.remote origin
git config branch.master.merge refs/heads/master

The last two commands sets up the local repository to track the remote repository by default, so that it will be the repo that responds to get push and git pull.

This blog post was inspired by this, this, this and this. And my general frustration of not having all this in a little script.

New hardware! 32 core box!

Our new server just arrived! Quad socket 32 core box, dope shit! Fresh from Intel, you can hardly buy this kind of hardware.

Haskell Idioms (Part 1?)

I’m taking Stanford’s Programming Languages class, and I’m back in the happy fuzzy world of functional programming that started my computer science education back at Berkeley with Prof. Brian Harvey’s CS61A. Except this time around we’re doing it in Haskell!

Haskell, so far, has a couple of cool features that’s jumped out at me. (Features is not really the right word, rather, design decision?):

  • Lazy Evaluation – this is the big, obvious one that really influences the idioms you use to write code
  • Pattern Matching that’s so ridiculous that if I haven’t been writing Scala I would be completely lost, now I’m only 99% mind-blown.
  • Cutesy math syntax (for example, list comprehension looks just like math notation: [x * 2 | x <- originalList]
  • Multi-Branch functions. AKA, you can define multiple versions of a function (like overloading) and the correct one will be taken. Sounds like overloading, but it blows your mind combined with laziness, of which I’ll give an example in a second
  • Functions are either curried or pattern-matched. So, pretty much all my functions can be partially applied.
  • Convert prefix operators to infix using “ and infix operators to prefix using ().  Neato
  • Type Inference. I’m a huge believer in static typing with type inference these days. Eating cake while having cake is the best.

OK, that was a boring lame list of stuff that I’ll probably skip whenever I read this blog post. Notice how I blog to remember stuff.

Multi-Branch function example:

listLength [] = 0
listLength (x:xs) = 1 + listLength xs

That’s just looks cool. Straight out of a math textbook.

Helper Functions Idiom:

Haskell (obviously) has nested functions, so you can write functions using local helper functions and accumulators inside the function’s scope. For example, you can write reverse in the “simple” functional manner like so:

reverse [] = []
reverse (x:xs) = (reverse xs) ++ [x]

Using, naturally, the cons operator (:), the concatenate operator (++) and the list syntax []. This is an inefficient algorithm, since it sweeps the list twice (once going down calling reverse on the cdr of the list, once sweeping back appending elements). Or, you can write reverse sweeping the list only once, using local scope and a helper function:

reverse xs =
  let rev( [], z) = z
      rev( y:ys, z) = rev( ys, y:z )
  in rev( xs, [] )

That’s a whole lot in only 4 lines, but it only sweeps the list once to reverse it. As you can (kinda) tell, it keeps cons-sing (: operator) the first element of the rest of the list to the first element of the new list being built. That is, it puts the “next” element as the “first” element. This is how you reverse a list, if you think about it for a bit. Cool huh!

Multi-Branch Function Idiom using Laziness

Since the compiler picks the function branch that pattern matches (including type matches) what you’re calling the function on, you can write plenty of code using a haskell idiom that would epic-fail in other languages.

The idiom consists of writing a main function that generates (lazily) all possible points in the solution space and exhaustively searches through it, and writing a set of small “filter” branches of this function that guides the search to be efficient. The laziness prevents tons of temporaries, the branches guides the search.

For example, consider substring matching. You can say that finding whether a string is a substring of another string is the same as generating all suffixes of the string and checking whether your string is a prefix on any of these suffixes. Let’s just write that:

x `isSubString` s = or [x `isPrefixOf` t | t <- suffixes s ]
suffixes [] = [""]
suffixes (x:xs) = (x:xs) : suffixes xs

mind blown! zomg! So, on the first line we say, x is a substring of s if x isPrefixOf t evaluates to true for all t in suffixes of s. then we define suffixes of the empty list as the empty string, and suffixes of a non-empty list as that list, cons’ed to the suffixes of everything but the first character of the input list. And that’s all you need, it won’t generate all options, it’ll lazily evaluate things, pattern-matching the type of list to the branch of the function, going nuts.

Define your own conditionals!

Since Haskell is lazy, you can write your own if statement or boolean operator and it’s just like the real thing! In fact, you can write your own eval function in about 10 lines.

Infinite data structures!

Of course, this just comes with laziness. Just implement prime number searching or whatever.

Lots of other cool stuff in here.

GTC 2010 Trip Report

I spent the week of September 20th though 23rd at NVIDIA’s GPU Technology Conference, and I’m now reporting back on some of the events that particularly interested me. I went to a whole range of talks, varying wildly in subject and quality, so I will gracefully ignore some of the talks while focusing on events that relate to my research or particularly struck my fancy.

As is usually the case with these conferences, I find that the most valuable experience is meeting the community of researchers around the world. Talks, papers and the like can be read by yourself, but conversation and networking at these conferences makes the visit absolutely worth going. GTC was no different, and I connected with several people that I aim to stay in touch with as research colleagues. This also seems to be an apt place to start my trip report! So, I met the following people during the week:

  • Tobias Brandvik (Cambridge), doing stencil abstractions for fluid flow solvers on multi-GPU clusters
  • Bryan Catanzaro (UC Berkeley), building Copperhead, a data-parallel python derivative that completely abstracts away the GPU.
  • Vasily Volkov (UC Berkeley), working on low-level performance tuning of GPUs and breaking down the CUDA abstractions.
  • Jason Cohen (NVIDIA), part of the Parallel NSight debugger team (Rick Shane from Adobe introduced us)
  • Nathan Bell (NVIDIA), the other half of the Thrust developer team (Jared Hobernock introduced us)

I will now attempt to distill the most interesting talks I attended to their core observation, and any notes I found especially interesting.

Opening Keynote

Jen-Hsun Huang, NVIDIA CEO

Jen-Hsuan was very much pushing on the importance of parallel programming, reiterating the arguments about the power wall the industry has hit, and pushing the fact the NVIDIA has been building highly parallel hardware for years now. The Senior Vice President of Content and Technology, Tony Tamasi, shows off several demos of Fermi’s tessellation capability (an endless city, procedurally generated, and tessellated on the GPU to give the equivalent of a 100 billion triangle scene). He moves on to the physics simulation capabilities of these GPUs by showing a re-imagination of Ron Fedkiw’s lighthouse scene running in real-time. A multi-grid height field combined with particles give real-time water, while flotsam is simulated as rigid bodies. All three simulations are coupled, and run in real time. Although it still looks simulated, it’s definitely ahead of today’s games.

The big statistic here for developers is the rate of CUDA adoption. NVIDIA very much pushes the idea that they have 100 million GPUs out in the field, all that can run CUDA programs. The reality of the situation is, naturally, not nearly this good, but it’s a nice statistic to have. The Folding@Home and SETI@Home people are reporting massively skewed statistics towards people running massively parallel processors, so there’s surely some truth to these numbers.

NVIDIA accounced CUDA-x86, a new compiler from PGI that allows compilation from CUDA code to x86 code, allowing developers to write programs that runs on multicore processors or throughput-based GPUs. In my mind this is just a nice-to-have, since none of the serious optimizations you do for the GPU (think coalesced memory accesses, specific thread groupings to exploit vector lanes and vector lane synchronization) will carry over to x86, and might even hurt performance (cache misses being caused by GPU-focused optimizations). Still, the write-one-run-anywhere dream is clearly very important, which is great for the research I’m working on.

Several other impressive demos were also shown: Dr. Black’s beating-heart-surgery that tracks a heart in real time to make incisions with a robotic scalpel, Abobe’s David Salesin showing off refocusing by using Plenoptic Lenses (originally done by Ren Ng from Stanford) and the iRay photorealistic raytracer running on 64 Fermi’s, rendering to your web browser at interactive rates. Clearly graphics has lots of evolution left as it enters the world of massively distributed computing.

Lastly, NVIDIA announced that their next two chips – Kepler and Maxwell – will have those codenames, and will aim for 3 times and 10 times the performance per watt of today’s Fermi’s.

A Fast, Scalable High-Order Unstructured Compressible Flow Solver

David Williams & Patrice Castonguay (Stanford)

I was curious to find out how this group built their flow solver to run on a GPU cluster. Since this is an example of what we’d like Liszt (our research language) to be able to do, so seeing a hand-written version was profitable. They followed the same MPI ideas as is generally used – partition your unstructured mesh and create ghost cells for the data you want to share across partition boundaries, placing a partition on each machine. They implemented their algorithm using a gather approach: The GPU would perform two stages of work, a first stage to calculate cell-based values, and a second stage to reduce these values to edge-based values. The synchronization between these two stages would include the MPI all-to-all step to resolve ghost cell values.

Since they wrote a specific instance of a RANS algorithm, they did not do any scheduling work or fine-grain synchronization, their two-stage gather was enough to run the algorithm. They were getting good linear speedups on their cluster, and managed to achieve a sustained 1.3 Teraflops on a cluster of 16 GPUs using a mesh of 320 000 cells.

New Programming Tools GPU Computing

Sequoia, Copperhead, GMAC, Thrust

Unfortunately, panel discussions with 3 minute introductions for each project is never enough to really understand any of the projects. The most striking part of this panel was the obvious programming language direction researchers have taken. Except for Thrust (although it can be considered an embedded domain specific language) all the work has a programming language spin on it. The major concern of the audience was clearly the support issues and feature-scarcity of new programming languages, which the different projects addressed differently – Sequioa tries to be a runtime more than a full language, Copperhead attempts to be deeply coupled to Python, Thrust passes itself off as a library and GMAC aims to be language-agnostic, creating a universal address space between accelerators (GPUs) and processors that any language can take advantage of.

PyCUDA (2041)

Andreas Klockener

Andreas’ PyCUDA talk was mostly an introduction to PyCUDA and a brief overview of how it works and the motivation behind it. I found this talk especially interesting, since he took an approach very similar to the way web frameworks use templates to generate web pages. Kernels in PyCUDA are strings of text, with embedded Python variables that is replaced when you ask his framework to compile the kernel. He built this JITting engine as an extention of Python, allowing you to write kernels at runtime and pass it off to the nvcc compiler to generate CUDA code. I liked the fairly low level control he allows you to achieve inside of Python, but PyCUDA does not attempt to abstract away CUDA or the GPU. It is, rather, very similar in spirit to the Boost Python bindings – allowing you to build software in Python, and rewrite the slow parts in C (or CUDA), calling from Python these low-level functions directly. PyCUDA has the added benefit that you do not even need to leave the Python interpreter. His whole approach was fascinating, especially since this is what I would have done were I faced with a similar problem, given my web framework experience. Andreas likens this to the LISP-style metaprogramming that’s been around since the 60s – manipulating string kernels, “pasting” in values on the fly.

PyCUDA in general is built to interface tightly with numpy and scipy, two Python packages that supply matlab-like functionality to Python users. PyCUDA does not attempt to address the type inference issue of moving from a dynamically typed language to a statically types one, since it depends on the user to write kernels with the correct types, and on numpy to supply runtime types of the multi-dimensional arrays that PyCUDA works with. Copperhead, Bryan Catanzaro’s data-parallel version of Python, abstracts away the GPU entirely, thus it has to deal with type inference, and he built a Hindley-Milner style type inference system into Python to handle this. Copperhead is built on top of PyCUDA, so he uses the JITting capabilities of PyCUDA to get to the GPU – a great decision in my mind, since someone else is now responsible for the low level details of catching errors and generating kernels.

Better Performance at Lower Occupancy (2238, Wed 15:00)

Vasily Volkov

(Slides here) Vasily has published several papers on understanding GPU hardware and tuning codes for the GPU, and his talk addressed the focus on massive multi-threading of GPU apps, showing the Instruction Level Parallelism is still a very important approach for the GPU. In the process of demonstrating this, he disproved several of NVIDIA’s claims in their Programming Guide. This talk was very interesting to me, since it addressed many of the low level architectural questions myself, Kayvon, Solomon, Zach and Jorge has discussed in detail. The use of the word “occupancy” in this talk refers to the percentage of threads spawned out of the total supported number on the multiprocessor.

The general recommendation to hide latencies is using more threads per block and more threads per multiprocessor. Vasily demonstrates that faster codes tend to run at lower occupancies, citing as examples the differences between CUBLAS and CUFFT versions – every performance improvement came with a lowering of threads per block. Vasily shows in the talk how to hide arithmetic latency and memory latency using fewer threads, and get a total performance increase with fewer threads. He also attempts to disprove the fallacies of shared memory being as fast as register files, addressing the bandwidth differences between the two.

The heart of his talk is the fact that Streaming Multi-Processors are still pipelined machines, regardless of the multi-threaded wide-vector-lane nature of these processors. By writing your code as sets of independent operations, keeping data dependencies to a minimum and structuring code to keep the pipeline filled, you can get massive performance regardless of the machine’s occupancy. He shows the roofline model for a simple SAXPY code, and how he can influence the memory-bound part of the model by doing multiple independent SAXPY operations in a single thread (since all but one of the input values are the same in a SAXPY operation). He continues to show that he can get 87% of the peak bandwidth available to an SMP at only 8% occupancy (while cudaMemCpy achieves only 71% of peak). Lastly he makes the point that the banked nature of shared memory makes it impossible for shared memory codes to achieve full bandwidth utilization. This leads to the recommendation to use as few threads using as much registers as possible.

The attention to detail in this talk, as Vasily played with the limits of the GPU, allowed him to break down several of the “nice” abstractions CUDA provides.

Large-Scale Gas Turbine Simulations on GPU Clusters (2118, Wed 16:00)

Tobias Brandvik

Tobias and his group at Cambridge has addressed a very similar problem as we have with Liszt. They want to write a mesh-based simulation system that runs on today’s high performance machines, without having to rewrite the code for each architecture. Specifically, they are building a production-quality solver for use in the Aerospace industry. Their approach has many overlaps with Liszt, targeting GPU clusters while avoiding the need to rewrite all their code for every variation of today’s heterogeneous machines. In contrast to our approach, they work at a fairly low level, since they only attempt to rewrite the approximately 10% of their code base that is stencil-based calculations. A stencil is defined as mesh accesses and data read/writes a kernel will perform, which allows them to generate specialized CUDA source code for each stencil in their application. This 10% of the code is roughly responsible for 90% of the run time, and can be abstracted as math kernels running specific stencils across a mesh.  The cluster aspect of their code is still hand written MPI code, but rather than write GPU or SMP specific codes that runs on an MPI node, they use these stencils and math kernels.

In terms of domain-specific optimizations, he referred to the 2008 Supercomputing paper by Datta et al that showed a set of optimizations to run stencil codes at high performance on GPU devices. They attempted to implement these optimizations as part of their source-to-source compilation process for kernels.

Their approach requires the programmer to hand-write the stencil and the math kernel. This approach allowed them to embed this stencil language partially inside fortran. They then took their current simulation system (TBLOCK, approximately 40kloc in Fortran) and factored out the stencil-based calculations into separate stencil definitions and math kernels. This allowed them to keep most of their current simulation code while spending “a couple of months” (Tobias’ words) on rewriting calculations that fit this stencil scheme into their embedded language with accompanying stencil definition files. Their system, called TurboStream, has on the order of 15 different stencils in it, with 3000 different stencil definitions, and they run simulations on a 64-GPU cluster at the University of Cambridge.

Tobias made an interesting comment during the talk, saying that their biggest concern is not pure speed, since their solvers are relatively simple, but that they want to be able to handle much more data than they currently do – this was their biggest motivation for moving to GPU clusters. Per way of example, he showed the fine detail of turbofan fins spinning through slits cut in the housing of a jet engine, and the fine-grain simulation details around these slits – geometry that was previously ignored.

Overall Impressions

The biggest gain of the conference was the networking with several other researchers, and getting an overall view of the field as several groups attempt to solve the same problem – how do we write codes that can run on all of today’s hardware choices.

I find myself using OpenTerminal a lot – mostly to open a terminal in a directory, followed by “mate .” to open a textmate project in this directory. This quickly becomes annoying, so after looking into AppleScript, I took the plunge and wrote my first AppleScript. what a weird language. Anyway, you can dump this into AppleScript Editor, and when you run it, it opens a textmate project of the front-most finder window:

on run
tell application "Finder"
try
activate
set frontWin to folder of front window as string
set frontWinPath to (get POSIX path of frontWin)
tell application "TextMate"
activate
open frontWinPath
end tell
on error error_message
beep
display dialog error_message buttons {"OK"} default button 1
end try
end tell
end run

Save this as an Application (not a Script), and drag it onto your finder toolbar. Voila! TextMate at your fingertips.

Thanks you Mac OS X Hints, from where I got the pattern to do this.