Optimization in the Twenty-First Century

I know, I know, don't optimize. Reduce algorithmic complexity and don't waste time on low-level noise. Or embrace the low-level and take advantage of magical machine instructions rarely emitted by compilers. Most of the literature on optimization focuses on these three recommendations, but in many cases they're no longer the best place to start. Gone are the days when you could look like a superstar by replacing long, linear lookups with a hash table. Everyone is already using the hash table from the get-go, because it's so easy.

And yet developers are still having performance problems, even on systems that are hundreds, thousands, or even over a hundred-thousand times than faster those which came before. Here's a short guide to speeding up applications in the modern world.

Get rid of the code you didn't need to write in the first place. Early programming courses emphasize writing lots of code, not avoiding it, and it's a hard habit to break. The first program you ever wrote was something like "Hello World!" It should have looked like this:
Hello world!
There's no code. I just typed "Hello world!" Why would anyone write a program for that when it's longer than typing the answer? Similarly, why would anyone compute a list of prime numbers at runtime--using some kind of sieve algorithm, for example--when you can copy a list of pre-generated primes? There are lots of applications out there with, um, factory manager caching classes in them that sounded great on paper, but interfacing with the extra layer of abstraction is more complex than what life was like before writing those classes. Don't write that stuff until you've tried to live without it and fully understand why you need it.

Fix that one big, dumb thing. There are some performance traps that look like everyday code, but can absorb hundreds of millions--or more--cycles. Maybe the most common is a function that manipulates long strings, adding new stuff to the end inside a loop. But, uh-oh, strings are immutable, so each of these append operations causes the entire multi-megabyte string to be copied.

It's also surprisingly easy to unintentionally cause the CPU and GPU to become synchronized, where one is waiting for the other. This is why reducing the number of times you hand-off vertex data to OpenGL or DirectX is a big deal. Sending a lone triangle to the GPU can be as expensive as rendering a thousand triangles. A more obscure gotcha is that writing to an OpenGL vertex buffer you've already sent off for rendering will stall the CPU until the drawing is complete.

Shrink your data. Smallness equals performance on modern hardware. You'll almost always win if you take steps to reduce the size of your data. More fits into cache. The garbage collector has less to trace through and copy around. Can you represent a color as an RGB tuple instead of a dictionary with the named elements "red", "green", and "blue"? Can you replace a bulky structure containing dozens of fields with a simpler, symbolic representation? Are you duplicating data that you could trivially compute from other values?

As an aside, the best across-the-board compilation option for most C/C++ compilers is "compile for size." That gets rid of optimizations that look good in toy benchmarks, but have a disproportionately high memory cost. If this saves you 20K in a medium-sized program, that's way more valuable for performance than any of those high-end optimizations would be.

Concurrency often gives better results than speeding up sequential code. Imagine you've written a photo editing app, and there's an export option where all the filters and lighting adjustments get baked into a JPEG. It takes about three seconds, which isn't bad in an absolute sense, but it's a long time for an interactive program to be unresponsive. With concerted effort you can knock a few tenths of a second off that, but the big win comes from realizing that you don't need to wait for the export to complete before continuing. It can be handled in a separate thread that's likely running on a different CPU core. To the user, exporting is now instantaneous.

(If you liked this, you might enjoy Use and Abuse of Garbage Collected Languages.)