Impressed by Slow Code

At one time I was interested in--even enthralled by--low-level optimization.

Beautiful and clever tricks abound. Got a function call followed by a return statement? Replace the pair with a single jump instruction. Once you've realized that "load effective address" operations are actually doing math, then they can subsume short sequences of adds and shifts. On processors with fast "count leading zero bits" instructions, entire loops can be replaced with a couple of lines of linear code.

I spent a long time doing that before I realized it was a mechanical process.

I don't necessarily mean mechanical in the "a good compiler can do the same thing" sense, but that it's a raw engineering problem to take a function and make it faster. Take a simple routine that potentially loops through a lot of data, like a case insensitive string comparison. The first step is to get as many instructions out of the loop as possible. See if what remains can be rephrased using fewer or more efficient instructions. Can any of the calculations be replaced with a small table? Is there a way to process multiple elements at the same time using vector instructions?

The truth is that there's no magic in taking a well-understood, working function, analyzing it, and rewriting it in a way that involves doing slightly or even dramatically less work at run-time. If I ended up with a routine that was a bottleneck, I know I could take the time to make it faster. Or someone else could. Or if it was small enough I could post it to an assembly language programming forum and come back in a couple of days when the dust settled.

What's much more interesting is speeding up something complex, a program where all the time isn't going into a couple of obvious hotspots.

All of a sudden, that view through the low-level magnifying glass is misleading. Yes, that's clearly an N-squared algorithm right there, but it may not matter at all. (It might only get called with with low values of N, for example.) This loop here contains many extraneous instructions, but that's hardly a big picture view. None of this helps with understanding the overall data flow, how much computation is really being done, and where the potential for simplification lies.

Working at that level, it makes sense to use a language that keeps you from thinking about exactly how your code maps to the underlying hardware. It can take a bit of faith to set aside deeply ingrained instincts about performance and concerns with low-level benchmarks, but I've seen Python programs that ended up faster than C. I've seen complex programs running under the Erlang virtual machine that are done executing before my finger is off the return key.

And that's what's impressive: code that is so easy to label as slow upon first glance, code containing functions that can--in isolation--be definitively proven to be dozens or hundreds of times slower than what's possible on a given CPU, and yet the overall program is decidedly one of high performance.

(If you liked this, you might enjoy Timidity Does Not Convince.)