<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>Programming in the 21st Century</title>
<link rel="self" href="http://prog21.dadgum.com/atom.xml" />
<link rel="alternate" href="http://prog21.dadgum.com/" />
<id>http://prog21.dadgum.com/</id>
<updated>2008-06-29T00:00:00-06:00</updated>
<entry>
<title>Want to Write a Compiler?  Just Read These Two Papers.</title>
<link rel="alternate" type="text/html" href="http://prog21.dadgum.com/30.html" />
<id>http://prog21.dadgum.com/30.html</id>
<published>2008-06-29T00:00:00-06:00</published>
<updated>2008-06-29T00:00:00-06:00</updated>
<author><name>James Hague</name></author>
<content type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml">Imagine you don't know <i>anything</i> about programming, and you want learn how to do it.  You take a look at Amazon.com, and there's a highly recommended set of books by Knute or something with a promising title, <i>The Art of Computer Programming</i>, so you buy them.  Now imagine that it's more than just a poor choice, but that <b>all</b> the books on programming are at written at that level.
<br/><br/>That's the situation with books about writing compilers.
<br/><br/>It's not that they're bad books, they're just too broadly scoped, and the authors present so much information that it's hard to know where to begin.  Some books are better than others, but there are still the thick chapters about converting regular expressions into executable state machines and different types of grammars and so on. After slogging through it all you will have undoubtedly expanded your knowledge, but you're no closer to actually writing a working compiler.
<br/><br/>Not surprisingly, the opaqueness of these books has led to the myth that compilers are hard to write.
<br/><br/>The best source for breaking this myth is Jack Crenshaw's series, <a href="http://compilers.iecc.com/crenshaw/">Let's Write a Compiler!</a>, which started in 1988.  This is one of those gems of technical writing where what's assumed to be a complex topic ends up being suitable for a first year programming class.  He focuses on compilers of the Turbo Pascal class: single pass, parsing and code generation are intermingled, and only the most basic of optimizations are applied to the resulting code.  The original tutorials used Pascal as the implementation language, but there's a C version out there, too.  If you're truly adventurous, Marcel Hendrix has done a <a href="http://home.iae.nl/users/mhx/crenshaw/tiny.html">Forth translation</a> (and as Forth is an interactive language, it's easier to experiment with and understand than the C or Pascal sources).
<br/><br/>As good as it is, Crenshaw's series has two major omissions.  The first is that there's no internal representation of the program at all.  That is, no abstract syntax tree.  It is indeed possible to bypass this step if you're willing to give up flexibility, but the main reason it's not in the tutorials is because manipulating trees in Pascal is out of sync with the simplicity of the rest of the code he presents.  If you're working in a higher level language--Python, Ruby, Erlang, Haskell, Lisp--then this worry goes away.  It's trivially easy to create and manipulate tree-like representations of data.  Indeed, this is what Lisp, Erlang, and Haskell were designed for.
<br/><br/>That brings me to <a href="http://www.cs.indiana.edu/~dyb/pubs/nano-jfp.pdf">A Nanopass Framework for Compiler Education</a> [PDF] by Sarkar, Waddell, and Dybvig.  The details of this paper aren't quite as important as the general concept:  a compiler is nothing more than a series of transformations of the internal representation of a program.  The authors promote using <b>dozens or hundreds of compiler passes</b>, each being as simple as possible.  Don't combine transformations; keep them separate.  The framework mentioned in the title is a way of specifying the inputs and outputs for each pass.  The code is in Scheme, which is dynamically typed, so data is validated at runtime.
<br/><br/>After writing a compiler or two, then go ahead and plunk down the cash for the infamous <a href="http://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools">Dragon Book</a> or one of the alternatives.  Maybe.  Or you might not need them at all.
</div></content>
</entry>
<entry>
<title>A Spellchecker Used to Be a Major Feat of Software Engineering</title>
<link rel="alternate" type="text/html" href="http://prog21.dadgum.com/29.html" />
<id>http://prog21.dadgum.com/29.html</id>
<published>2008-06-08T00:00:00-06:00</published>
<updated>2008-06-08T00:00:00-06:00</updated>
<author><name>James Hague</name></author>
<content type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml">Here's the situation: it's 1984, and you're assigned to write the spellchecker for a new MS-DOS word processor.  Some users, but not many, will have 640K of memory in their PCs.  You need to support systems with as little as 256K.  That a quarter megabyte to contain the word processor, the document being edited, and the memory needed by the operating system.  Oh, and the spellchecker.
<br/><br/>For reference, on my MacBook, the standard dictionary in /usr/share/dict/words is 2,486,813 bytes and contains 234,936 words.
<br/><br/>An enticing first option is a data format that's more compressed than raw text.  The UNIX dictionary contains <i>stop</i> and <i>stopped</i> and <i>stopping</i>, so there's a lot of repetition.  A clever trie implementation might do the trick...but we'll need a big decrease to go from 2+ megabytes to a hundred K or so.
<br/><br/>In fact, even if we could represent each word in the spellchecker dictionary as a single byte, we need almost all the full 256K just for that, and of course the single byte representation isn't going to work.  So not only does keeping the whole dictionary in RAM look hopeless, but so does keeping the actual dictionary on disk with only an index in RAM.
<br/><br/>Now it gets messy.  We could try taking a subset of the dictionary, one containing the most common words, and heavily compressing that so it fits in memory.  Then we come up with a slower, disk-based mechanism for looking up the rest of the words.  Or maybe we jump directly to a completely disk-based solution using a custom database of sorts (remembering, too, that we can't assume the user has a hard disk, so the dictionary still needs to be crunched onto a 360K floppy disk).
<br/><br/>On top of this, we need to handle some other features, such as the user adding new words to the dictionary.
<br/><br/>Writing a spellchecker in the mid-1980s was a hard problem.  Programmers came up with some impressive data compression methods in response to the spellchecker challenge.  Likewise there were some very clever data structures for quickly finding words in a compressed dictionary.  This was a problem that could take months of focused effort to work out a solution to.  (And, for the record, reducing the size of the dictionary from 200,000+ to 50,000 or even 20,000 words was a reasonable option, but even that doesn't leave the door open for a naive approach.)
<br/><br/>Fast forward to today.  A program to load /usr/share/dict/words into a hash table is 3-5 lines of Perl or Python, depending on how terse you mind being.  Looking up a word in this hash table dictionary is a trivial expression, one built into the language.  <i>And that's it.</i>  Sure, you could come up with some ways to decrease the load time or reduce the memory footprint, but that's icing and likely won't be needed.  The basic implementation is so mindlessly trivial that it could be an exercise for the reader in an early chapter of any Python tutorial.
<br/><br/>That's progress.
</div></content>
</entry>
<entry>
<title>Coding As Performance</title>
<link rel="alternate" type="text/html" href="http://prog21.dadgum.com/28.html" />
<id>http://prog21.dadgum.com/28.html</id>
<published>2008-05-31T00:00:00-06:00</published>
<updated>2008-05-31T00:00:00-06:00</updated>
<author><name>James Hague</name></author>
<content type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml">I want to talk about performance coding.  Not coding for <b>speed</b>, but coding <b>as</b> performance, a la <a href="http://en.wikipedia.org/wiki/Live_coding">live coding</a>.  Okay, I don't really want to talk about that either, as it mostly involves audio programming languages used for on-the-fly music composition, but I like the principle of it: writing programs very quickly, in the timescale of TV show or movie rather than the years it can take to complete a commercial product.  Take any book on agile development or extreme programming and replace "weeks" with "hours" and "days" with "minutes."
<br/><br/>Think of it in terms of a co-worker or friend who comes to you with a problem, something that could be done by hand, but would involve much repetitive work ("I've got a big directory tree, and I need a list of the sum total sizes of all files with the same root names, so hello.txt, hello.doc, and hello.whatever would just show in the report as 'hello', followed by the total size of those three files").  If you can write a program to solve the problem in less time than the tedium of slogging through the manual approach, then you win.  There's no reason to limit this game to this kind of problem, but it's a starting point.
<br/><br/>Working at this level, the difference between gut instinct and proper engineering becomes obvious.  The latter always seems to involve additional time--architecture, modularity, code formatting, interface specification--which is exactly what's in short supply in coding as performance.  Imagine you want to plant a brand new vegetable garden somewhere in your yard, and the first task is to stake out the plot.  Odds are good that you'll be perfectly successful by just eyeballing it, hammering a wooden stake at one corner, and using it as a reference.  Or you could be more formal and use a tape measure.  The ultimate, guaranteed correct solution is to hire a team of surveyors to make sure the distances are exact and the sides perfectly parallel.  But really, who would do that?
<br/><br/>(And if you're thinking "not me," consider people like myself who've grepped a two-hundred megabyte XML file, because it was easier than remembering how to use the available XML parsing libraries.  If your reaction is one of horror because I clearly don't understand the whole purpose of using XML to structure data, then there you go.  You'd hire the surveyors.)
<br/><br/>You can easily spot the programming languages designed for projects operating on shorter timescales. Common, non-trivial operations are built-in, like regular expressions and matrix math (as an aside, the original BASIC language from the 1960s had matrix operators).  Common functions--reading a file, getting the size of a file--don't require importing libraries after you've managed to remember that getting the size of a file isn't a core operation that's in the "file" library and is instead in "os:file:filesize" or wherever the hierarchical-thinking author put it.  But really, any language of the Python or Ruby class is going to be fine.  The big wins are having an interactive read / evaluate / print loop, zero compilation time, and data structures that don't require thinking about low-level implementation details.
<br/><br/>What matter just as much are <b>visualization tools</b>, so you can avoid the classic pitfall of engineering something for weeks or months only to finally realize that you didn't understand the problem and engineered the wrong thing. (Students of <a href="http://www.cs.utexas.edu/users/EWD/">Dijkstra</a> are ready with some good examples of math problems where attempting to guess an answer based on a drawing gives hopelessly incorrect answers, but I'll pretend I don't see them, there in the back, frantically waving their arms.)
<br/><br/>I once used an 8-bit debugger with an interrupt-driven display.  Sixty times per second, the display was updated.  This meant that memory dumps were <b>live</b>.  If a running program constantly changed a value, that memory location showed as blurred digits on the screen.  You could also see numbers occasionally flick from 0 to 255, then back later.  Static parts of the screen meant nothing was changing there.  This sounds simple, but wow was it useful for accidentally spotting memory overruns and logic errors.  I often never suspected a problem, and I wouldn't haven even known what to look for, but found an error just by seeing movement or patterns in a memory dump that didn't look right.
<br/><br/>A modern visualization tool I can't live without is <a href="http://www.weitz.de/regex-coach/">RegEx Coach</a>. I always try out regular expressions using it before copying them over to my Perl or Python scripts.  When I make an error, I <b>see</b> it right away.  That prevents situations where the rest of my program is fine, but a botched regular expression isn't pulling in exactly the data I'm expecting.
<br/><br/>The <a href="http://jsoftware.com">J language</a> ships with some great visualization tools.  Arguably it's the nicest programming environment I've ever used, even though I go back and forth about whether J itself is brilliant or insane.  There's a standard library module which takes a matrix and displays it as a grid of colors.  Identical values use the same color.  Simplistic?  Yes.  But this display format makes patterns and anomalies jump out of the screen.  If you're thinking that you don't write code that involves matrix math, realize that matrices are native to J and you can easily put all sorts of data into a matrix format (in fact, the preferred term for a matrix in J is the more casual "table").
<br/><br/>J also has a similar tool that mimics a spreadsheet display.  Pass in data, and up pops what looks like an Excel window, making it easy to view data that is naturally columnar.  It's easier than dumping values to an HTML file or the old-fashioned method of debug printing a table using a fixed-width font.  There's also an elaborate module for graphing data; no need to export it to a file and use a standalone program.
<br/><br/>I'm hardly suggesting that everyone--or anyone--switch over to J.  It's not the language semantics that matter so much as tools that are focused on interactivity, on working through problems quickly.  And the realization that it is valid to get an answer without always bringing the concerns of software engineering--and the time penalty that comes with them--into the picture.
</div></content>
</entry>
</feed>