I'm a recovering programmer who has been designing video games since the 1980s, doing things that seem baroquely hardcore in retrospect, like writing Super Nintendo games entirely in assembly language. These days I use whatever tools are the most fun and give me the biggest advantage.
james.hague @ gmail.com
Where are the comments?
Revisiting "Programming as if Performance Mattered"In 2004 I wrote Programming as if Performance Mattered, which became one of my most widely read articles. (If you haven't read it yet, go ahead; the rest of this entry won't make a lot of sense otherwise. Plus there are spoilers, something that doesn't affect most tech articles.) In addition to all the usual aggregator sites, it made Slashdot which resulted in a flood of email, both complimentary and bitter. Most of those who disagreed with me can be divided into two groups.
The first group flat-out didn't get it. They lectured me about how my results were an anomaly, that interpreted languages are dog slow, and that performance comes from hardcore devotion to low-level optimization. This is even though my entire point was about avoiding knee-jerk definitions of fast and slow. The mention of game programming at the end was a particular sore point for these people. "You obviously know nothing about writing games," they raved, "or else you'd know that every line of every game is carefully crafted for the utmost performance." The amusing part is that I've spent almost my entire professional career--and a fairly unprofessional freelance stint before that--writing games.
The second group was more savvy. These people had experience writing image decoders and knew that my timings, from an absolute point of view, were nowhere near the theoretical limit. I talked of decoding the sample image in under 1/60th of a second, and they claimed significantly better numbers. And they're completely correct. In most cases 1/60th of a second is plenty fast for decoding an image. But if a web page has 30 images on it, we're now up to half a second just for the raw decoding time. Good C code to do the same thing will win by a large margin. So the members of this group, like the first, dismissed my overall point.
What surprised me about the second group was the assumption that my Erlang code is as fast as it could possibly get, when in fact there are easy ways of speeding it up.
First, just to keep the shock value high, I kept my code in pure, interpreted Erlang. But there's a true compiler as part of the standard Erlang distribution, and simply compiling the
tgamodule will halve execution time, if not decrease it by a larger factor.
Second, I completely ignored concurrent solutions, both within the decoding of a single image and potentially spinning each image into its own process. The latter solution wouldn't improve execution time of my test case, but could be a big win if many images are decoded.
Then there's perhaps the most obvious thing to do, the first step when it comes to understanding the performance of real code. Perhaps my detailed optimization account made it appear that I had reached the end of the road, that no more performance could be eked out of the Erlang code. In any case, no one suggested profiling the code to see if there are any obvious bottlenecks. And there is such a bottleneck.
(There's one more issue too: in the end, the image decoder was sped-up enough that it was executing below the precision threshold of the wall clock timings of
timer:tc/3. I could go in and remove parts of the decoder--obviously giving incorrect results--and still get back the same timings of 15,000 microseconds. The key point is that my reported timings were likely higher than they really were.)
Here's the output of the eprof profiler on
decode_rgb1, which is part of
decode_rgb. The final version of this function last time around was this:
In Erlang R12B both of these issues have been addressed, so
decode_rgbcan be written in the straightforward way, operating on binaries the whole time:
But we can do better with a small change to the specification. Remember,
decode_rgbis a translation from 24-bit to 32-bit pixels. When the initial pixel is a magic number--255,0,255--the alpha channel of the output is set to zero, indicating transparency. All other pixels have the alpha set to 255, which is fully opaque. If you look at the code, you'll see that the 255,0,255 pixels actually get turned into 0,0,0,0 instead of 255,0,255,0. There's no real reason for that. In fact, if we go with the simpler approach of only changing the alpha value, then
decode_rgbcan be written using in an amazingly clean way:
(Also see the follow-up.)