Exploring Audio Files with Erlang

It takes surprisingly little Erlang code to dig into the contents of an uncompressed audio file. And it turns out that three of the most common uncompressed audio file formats--WAV, AIFF, and Apple's CAF--all follow the same general structure. Once you understand the basics of one, it's easy to deal with the others. AIFF is the trickiest of the three, so that's the one I'll use as an example.

First, load the entire file into a binary:
load(Filename) -> {ok, B} = file:read_file(Filename), B.
There's a small header: four characters spelling out "FORM", a length which doesn't matter, then four more characters spelling out "AIFF". The interesting part is the rest of the file, so let's just validate the header and put the rest of the file into a binary called B:
<<"FORM", _:32, "AIFF", B/binary>> = load(Filename).
The "rest of file" binary is broken into chunks that follow a simple format: a four character chunk name, the length of the data in the chunk (which doesn't include the header), and then the data itself. Here's a little function that breaks a binary into a list of {Chunk_Name, Contents} pairs:
chunkify(Binary) -> chunkify(Binary, []). chunkify(<<N1,N2,N3,N4, Len:32, Data:Len/binary, Rest/binary>>, Chunks) -> Name = list_to_atom([N1,N2,N3,N4]), chunkify(adjust(Len, Rest), [{Name, Data}|Chunks]); chunkify(<<>>, Chunks) -> Chunks.
Ignore the adjust function for now; I'll get back to that.

Given the results of chunkify, it's easy to find a specific chunk using lists:keyfind/3. Really, though, other than to test the chunkification code, there's rarely a reason to iterate through all the chunks in a file. It's nicer to return a function that makes lookups easy. Replace the last line of chunkify with this:
fun(Name) -> element(2, lists:keyfind(Name, 1, Chunks)) end.
The key info about sample rates and number of channels and all that is in a chunk called COMM and now we've got an easy way to get at and decode that chunk:
Chunks = chunkify(B). <<Channels:16, Frames:32, Sample_Size:16, Rate:10/binary>> = Chunks('COMM').
The sound samples themselves are in a chunk called SSND. The first eight bytes of that chunk don't matter, so to decode that chunk it's just:
<<_:8/binary, Samples/binary>> = Chunks('SSND').
Okay, now the few weird bits of the AIFF format. First, if the size of a chunk is odd, then there's one extra pad byte following it. That's what the adjust function is for. It checks if a pad byte exists and removes it before decoding the rest of the binary. The second quirk is that the sample rate is encoded as a ten-byte extended floating point value, and most languages don't have support for them--including Erlang. There's an algorithm in the AIFF spec for encoding and decoding extended floats, and I translated it into Erlang.

Here's the complete code for the AIFF decoder:
load_aiff(Filename) -> <<"FORM", _:32, "AIFF", B/binary>> = load(Filename), Chunks = chunkify(B), <<Channels:16, Frames:32, Sample_Size:16, Rate:10/binary>> = Chunks('COMM'), <<_:8/binary, Samples/binary>> = Chunks('SSND'), {Channels, Frames, Sample_Size, extended_to_int(Rate), Samples}. chunkify(Binary) -> chunkify(Binary, []). chunkify(<<N1,N2,N3,N4, Length:32, Data:Length/binary, Rest/binary>>, Chunks) -> Name = list_to_atom([N1,N2,N3,N4]), chunkify(adjust(Length, Rest), [{Name, Data}|Chunks]); chunkify(<<>>, Chunks) -> fun(Name) -> element(2, lists:keyfind(Name, 1, Chunks)) end. adjust(Length, B) -> case Length band 1 of 1 -> <<_:8, Rest/binary>> = B, Rest; _ -> B end. extended_to_int(<<_, Exp, Mantissa:32, _:4/binary>>) -> extended_to_int(30 - Exp, Mantissa, 0). extended_to_int(0, Mantissa, Last) -> Mantissa + (Last band 1); extended_to_int(Exp, Mantissa, _Last) -> extended_to_int(Exp - 1, Mantissa bsr 1, Mantissa). load(Filename) -> {ok, B} = file:read_file(Filename), B.
WAV and CAF both follow the same general structure of a header followed by chunks. WAV uses little-endian values, while the other two are big-endian. CAF doesn't have chunk alignment requirements, so that removes the need for adjust. And fortunately it's only AIFF that requires that ugly conversion from extended floating point in order to get the sample rate.