I'm James Hague, a recovering programmer who has been designing video games since the 1980s. This is Why You Spent All that Time Learning to Program and The Pure Tech Side is the Dark Side are good places to start.
Where are the comments?
Timings and the Punchline
I forgot two things in Revisiting "Programming as if Performance Mattered": exact timings of the different versions of the code and a punchline. I'll do the timings first.
timer:tc falls apart once code gets too fast. A classic sign of this is running consecutive timings and getting back a sequence of numbers like 15000, 31000, 31000, 15000. At this point you should write a loop to execute the test function, say, 100 times, then divide the total execution time by 100. This smooths out interruptions for garbage collection, system processes, and so on.
And now the timings (lower is better). The TGA image decoder with the clunky binary / list / binary implementation of
decode_rgb, on the same sample image I used in 2004:
(Yes, this is larger than the original 15,000 I reported, because it's an average, not the result of one or two runs.) The recursive version operating directly on binaries:
The ultra-slick version using binary comprehensions:
I think the punchline is obvious at this point.
Were I using this module in production code, I'd do one of three things. If I'm only decoding a handful of images here and there, then this whole discussion is irrelevant. The Erlang code is more than fast enough. If image decoding is a huge bottleneck, I'd move the hotspot,
decode_rgb into a small linked-in driver. Or, and the cries of cheating may be justified here, I'd remove
Remember, transparent pixels runs at the start and end of each row are already detected elsewhere.
decode_rgb blows up the runs in the middle from 24-bit to 32-bit. At some point this needs to be done, but it may just be that it doesn't need to happen at the Erlang level at all. If the pixel data is passed off to another non-Erlang process anyway, maybe for rendering or for printing or some other operation, then there's no reason the compressed 24-bit data can't be passed off directly. That fits the style I've been using for this whole module, of operating on compressed data without a separate decompression step.
But now we're getting into useless territory: quibbling over microseconds without any actual context. You can't feel the difference between any of the optimized versions of the code I presented last time, and so it doesn't matter.