Garbage Collection in Erlang

Given its "soft real time" label, I expected Erlang to use some fancy incremental garbage collection approach. And indeed, such an approach exists, but it's slower than traditional GC in practice (because it touches the entire the heap, not just the live data). In reality, garbage collection in Erlang is fairly vanilla. Processes start out using a straightforward compacting collector. If a process gets large, it is automatically switched over to a generational scheme. The generational collector is simpler than in some languages, because there's no way to have an older generation pointing to data in a younger generation (remember, you can't destructively modify a list or tuple in Erlang).

The key is that garbage collection in Erlang is per process. A system may have tens of thousands of processes, using a gigabyte of memory overall, but if GC occurs in a process with a 20K heap, then the collector only touches that 20K and collection time is imperceptible. With lots of small processes, you can think of this as a truly incremental collector. But there's still a lurking worst case in Erlang: What if all of those processes run out of memory more or less in the same wall-clock moment? And there's nothing preventing an application from using one massive process (such is the case with the Wings 3D modeller).

Per-process GC allows a slick technique that can completely prevent garbage collection in some circumstances. Using spawn_opt instead of the more common spawn, you can specify the initial heap size for a process. If you know, as discovered through profiling, that a process rapidly grows up to 200K and then terminates, you can give that process an initial heap size of 200K. Data keeps getting added to the end of the heap, and then before garbage collection kicks in, the process heap is deleted and its contents are never scanned.

The other pragmatic approach to reducing the cost of garbage collection in Erlang is that lots of data is kept outside of the per-process heaps:

Binaries > 64 bytes. Large binaries are allocated in a separate heap outside the scope of a process. Binaries can't, by definition, contain pointers to other data, so they're reference counted. If there's a 50MB binary loaded, it's guaranteed never to be copied as part of garbage collection.

Data stored in ETS tables. When you look up key in an ETS table, the data associated with that key is copied into the heap for the process the request originated from. For structurally large values (say, a tuple of 500 elements) the copy from ETS table space to the process heap may become expensive, but if there's 100MB of total data in a table, there's no risk of all that data being scanned at once by a garbage collector.

Data structure constants. This is new in Erlang.

Atom names. Atom name strings are stored in a separate data area and are not garbage collected. In Lisp, it's common for symbol names to be stored on the main heap, which adds to garbage collection time. But that also means that dynamically creating symbols in Lisp is a reasonable approach to some problems, but it's not something you want to do in Erlang.

permalink January 6, 2008

Garbage Collection in Erlang

previously

archives