Revisiting "Programming as if Performance Mattered"

In 2004 I wrote Programming as if Performance Mattered, which became one of my most widely read articles. (If you haven't read it yet, go ahead; the rest of this entry won't make a lot of sense otherwise. Plus there are spoilers, something that doesn't affect most tech articles.) In addition to all the usual aggregator sites, it made Slashdot which resulted in a flood of email, both complimentary and bitter. Most of those who disagreed with me can be divided into two groups.

The first group flat-out didn't get it. They lectured me about how my results were an anomaly, that interpreted languages are dog slow, and that performance comes from hardcore devotion to low-level optimization. This is even though my entire point was about avoiding knee-jerk definitions of fast and slow. The mention of game programming at the end was a particular sore point for these people. "You obviously know nothing about writing games," they raved, "or else you'd know that every line of every game is carefully crafted for the utmost performance." The amusing part is that I've spent almost my entire professional career--and a fairly unprofessional freelance stint before that--writing games.

The second group was more savvy. These people had experience writing image decoders and knew that my timings, from an absolute point of view, were nowhere near the theoretical limit. I talked of decoding the sample image in under 1/60th of a second, and they claimed significantly better numbers. And they're completely correct. In most cases 1/60th of a second is plenty fast for decoding an image. But if a web page has 30 images on it, we're now up to half a second just for the raw decoding time. Good C code to do the same thing will win by a large margin. So the members of this group, like the first, dismissed my overall point.

What surprised me about the second group was the assumption that my Erlang code is as fast as it could possibly get, when in fact there are easy ways of speeding it up.

First, just to keep the shock value high, I kept my code in pure, interpreted Erlang. But there's a true compiler as part of the standard Erlang distribution, and simply compiling the tga module will halve execution time, if not decrease it by a larger factor.

Second, I completely ignored concurrent solutions, both within the decoding of a single image and potentially spinning each image into its own process. The latter solution wouldn't improve execution time of my test case, but could be a big win if many images are decoded.

Then there's perhaps the most obvious thing to do, the first step when it comes to understanding the performance of real code. Perhaps my detailed optimization account made it appear that I had reached the end of the road, that no more performance could be eked out of the Erlang code. In any case, no one suggested profiling the code to see if there are any obvious bottlenecks. And there is such a bottleneck.

(There's one more issue too: in the end, the image decoder was sped-up enough that it was executing below the precision threshold of the wall clock timings of timer:tc/3. I could go in and remove parts of the decoder--obviously giving incorrect results--and still get back the same timings of 15,000 microseconds. The key point is that my reported timings were likely higher than they really were.)

Here's the output of the eprof profiler on tga:test_compressed():

FUNCTION                                       CALLS      TIME 

****** Process <0.46.0>    -- 100 % of profiled time *** 
tga:decode_rgb1/1                              54329      78 % 
lists:duplicate/3                              11790      7 % 
tga:reduce_rle_row/3                           2878       3 % 
tga:split/1                                    2878       3 % 
tga:combine/1                                  2874       3 % 
erlang:list_to_binary/1                        1051       2 % 
tga:expand/3                                   1995       1 % 
tga:continue_rle_row/7                         2709       1 % 
lists:reverse/1                                638        0 
...

Sure enough, most of the execution time is spent in decode_rgb1, which is part of decode_rgb. The final version of this function last time around was this:

decode_rgb(Pixels) ->
   list_to_binary(decode_rgb1(binary_to_list(Pixels))).
decode_rgb1([255,0,255 | Rest]) ->
   [0,0,0,0 | decode_rgb1(Rest)];
decode_rgb1([R,G,B | Rest]) ->
   [R,G,B,255 | decode_rgb1(Rest)];
decode_rgb1([]) -> [].

This is short, but contrived. The binary blob of pixels is turned into a list, then the new pixels are built-up in reverse order as a list, and finally that list is reversed and turned back into a binary. There are two reasons for the contrivance. At the time, pattern matching was much faster on lists than binaries, so it was quicker to turn the pixels into a list up front (I timed it). Also, repeatedly appending to a binary was a huge no-no, so it was better to create a new list and turn it into a binary at the end.

In Erlang R12B both of these issues have been addressed, so decode_rgb can be written in the straightforward way, operating on binaries the whole time:

decode_rgb(Pixels) -> decode_rgb(Pixels, <<>>).
decode_rgb(<<255,0,255, Rest/binary>>, Result) ->
   decode_rgb(Rest, <<Result/binary,0,0,0,0>>);
decode_rgb(<<R,G,B, Rest/binary>>, Result) ->
   decode_rgb(Rest, <<Result/binary,R,G,B,255>>);
decode_rgb(<<>>, Result) -> Result.

This eliminates the memory pressure caused by expanding each byte of the binary to eight bytes (the cost of an element in a list).

But we can do better with a small change to the specification. Remember, decode_rgb is a translation from 24-bit to 32-bit pixels. When the initial pixel is a magic number--255,0,255--the alpha channel of the output is set to zero, indicating transparency. All other pixels have the alpha set to 255, which is fully opaque. If you look at the code, you'll see that the 255,0,255 pixels actually get turned into 0,0,0,0 instead of 255,0,255,0. There's no real reason for that. In fact, if we go with the simpler approach of only changing the alpha value, then decode_rgb can be written using in an amazingly clean way:

decode_rgb(Pixels) ->
   [<<R,G,B,(alpha(R,G,B))>> || <<R,G,B>> <= Pixels].

alpha(255, 0, 255) -> 0;
alpha(_, _, _) -> 255.

This version uses bitstring comprehensions, a new feature added in Erlang R12B. It's hard to imagine writing this with any less code.

(Also see the follow-up.)

permalink December 16, 2007

Revisiting "Programming as if Performance Mattered"

previously

archives