Two Stories of Simplicity

In response to Sending modern languages back to 1980s game programmers, one of the questions I received was "Did any 8-bit coders ever use more powerful computers for development?" Sure! The VAX and PDP-11 and other minicomputers were available at the time, though expensive, and some major developers made good use of them, cross-compiling code for the lowly Atari 800 and Apple II. But there was something surprising about some of these systems:

It was often slower to cross-assemble a program on a significantly higher-specced machine like the VAX than it was to do the assembly on a stock 8-bit home computer.

Part of the reason is that multiple people were sharing a single VAX, working simultaneously, but the Apple II user had the whole CPU available for a single task. There was also the process of transferring the cross-assembled code to the target hardware, and this went away if the code was actually built on the target. And then there were inefficiencies that built up because the VAX was designed for large-scale work: more expensive I/O libraries, more use of general purpose tools and code.

For example, a VAX-hosted assembler might dynamically allocate symbols and other data on the heap, something typically not used on a 64K home computer. Now a heap manager--what malloc sits on top of--isn't a trivial bit of code. More importantly, you usually can't predict how much time a request for a block of memory will take to fulfill. Sometimes it may be almost instantaneous, other times it may take thousands of cycles, depending on the algorithms used and current state of the heap. Meanwhile, on the 8-bit machine, those thousands of cycles are going directly toward productive work, not solving the useful but tangential problem of how to effectively manage a heap.

So in the end there were programmers with these little 8-bit machines outperforming minicomputers costing hundreds of thousands of dollars.

That ends the first story.

When I first started programming the Macintosh, right after the switch to PowerPC processors in the mid-1990s, I was paranoid about system calls. I knew the system memory allocation routines were unpredictable and should be avoided in performance-oriented code. I'd noticeably sped up a commercial application by dodging a system "Is point in an arbitrarily complex region?" function. It was in this mindset that I decided to steer clear of Apple's BlockMove function--the MacOS equivalent of memcpy--and write my own.

The easy way to write a fast memory copier is to move as much data at a time as possible. 32-bit values are better than 8-bit values. The problem with using 32-bit values exclusively is that there are alignment issues. If the source address isn't aligned on a four-byte boundary, it's almost as bad as copying 8-bits at a time. BlockMove contained logic to handle misaligned addresses, breaking things into two steps: individual byte moves until the source address was properly aligned, then 32-bit copies from that point on. My plan was that if I always guaranteed that the source and destination addresses were properly aligned, then I could avoid all the special-case address checks and have a simple loop reading and writing 32-bits at a time.

(It was also possible to read and write 64-bit values, even on the original PowerPC chips, using 64-bit floating point registers. But even though this looked good on paper, floating point loads and stores had a slightly longer latency than integer loads and stores.)

I had written a very short, very concise aligned memory copy function, one that clearly involved less code than Apple's BlockMove.

Except that BlockMove was faster. Not just by a little, but 30% faster for medium to large copies.

I eventually figured out the reason for this by disassembling BlockMove. It was even more convoluted than I expected in terms of handling alignment issues. It also had a check for overlapping source and destination blocks--more bloat from my point of view. But there was a nifty trick in there that I never would have figured out on my own.

Let's say that a one megabyte block of data is being copied from one place to another. During the copy loop the data at the source and destination addresses is constantly getting loaded into the cache, 32 bytes at a time (the size of a cache line on early PowerPC chips), two megabytes of cache loads in all.

If you think about this, there's one flaw: all the data from the destination is loaded into the cache...and then it's immediately overwritten by source data. BlockMove contained code to align addresses to 32 byte cache lines, then in the inner copy loop used a special instruction to avoid loading the destination data, setting an entire cache line to zeros instead. For every 32 bytes of data, my code involved two cache line reads and one cache line write. The clever engineer who wrote BlockMove removed one of these reads, resulting in a 30% improvement over my code. This is even though BlockMove was pages of alignment checks and special cases, instead of my minimalist function.

There you go: one case where simpler was clearly better, and one case where it wasn't.