Timings and the Punchline

I forgot two things in Revisiting "Programming as if Performance Mattered": exact timings of the different versions of the code and a punchline. I'll do the timings first.

timer:tc falls apart once code gets too fast. A classic sign of this is running consecutive timings and getting back a sequence of numbers like 15000, 31000, 31000, 15000. At this point you should write a loop to execute the test function, say, 100 times, then divide the total execution time by 100. This smooths out interruptions for garbage collection, system processes, and so on.

And now the timings (lower is better). The TGA image decoder with the clunky binary / list / binary implementation of decode_rgb, on the same sample image I used in 2004:
16,720 microseconds
(Yes, this is larger than the original 15,000 I reported, because it's an average, not the result of one or two runs.) The recursive version operating directly on binaries:
18,700 microseconds
The ultra-slick version using binary comprehensions:
22,600 microseconds
I think the punchline is obvious at this point.

Were I using this module in production code, I'd do one of three things. If I'm only decoding a handful of images here and there, then this whole discussion is irrelevant. The Erlang code is more than fast enough. If image decoding is a huge bottleneck, I'd move the hotspot, decode_rgb into a small linked-in driver. Or, and the cries of cheating may be justified here, I'd remove decode_rgb completely.

Remember, transparent pixels runs at the start and end of each row are already detected elsewhere. decode_rgb blows up the runs in the middle from 24-bit to 32-bit. At some point this needs to be done, but it may just be that it doesn't need to happen at the Erlang level at all. If the pixel data is passed off to another non-Erlang process anyway, maybe for rendering or for printing or some other operation, then there's no reason the compressed 24-bit data can't be passed off directly. That fits the style I've been using for this whole module, of operating on compressed data without a separate decompression step.

But now we're getting into useless territory: quibbling over microseconds without any actual context. You can't feel the difference between any of the optimized versions of the code I presented last time, and so it doesn't matter.