Wednesday, June 28, 2006

Some old thoughts about TeX

I saw a link on reddit the other day regarding some math-related videos, including a series of lectures by Don Knuth, including some on TeX internals. (One frustrating bit about the Web video is the blurriness of his terminal. Another is that he uses what is now a slightly-out-of-date version of TeX itself).

He made a comment near the beginning of section 3, probably similar to things he has written, to the effect that in his education he had benefited both from reading badly-written programs, because they were evidence he could do better, and from well-written programs, because they were a pleasure to read. He then wondered aloud which category TeX is in, but that either way it was a win. I'm not at all sure myself. Perhaps TeX is a monument to the best that can be done in Pascal, and simultaneously a warning that it is the best that can be done in Pascal.

I had a definite feeling during grad school, when I was learning LaTeX to write my thesis, and learning Lisp on the side, that a large chunk of TeX internals was spent managing things that manage themselves in Lisp
  1. Dynamic memory (including variable-length strings and reference-counted objects)
  2. Interned static strings (i.e. symbols)
  3. Rational arithmetic protected from overflow and portable between machines
  4. Heterogeneous lists

Knuth also spent a good deal of time, however, fretting about the "inner loop" of TeX, so perhaps re-implementing it in a higher-level language would incur a performance penalty of large constant factors in the runtime of TeX.

I spent a few odd hours with TeX, The Program, coding a few bits in Lisp, as well as beginning a few sketches that could serve as commentary. (One of the most grievous flaws in TeX, The Program is its utter lack of diagrams, apart from the rough memory map on the back flyleaf.) The main topic of the commentary was to sketch memory structures and the overall memory map in diagram form.

Another grievous flaw I saw, perhaps related to the efficiency concerns, was the monolithic architecture. He speaks of TeX's eyes and mouth and digestive tract, but the WEB presentation blurs this all together; as far as I could tell with my limited reading the TeX syntax is processed directly into lower level memory structures without any clear documentation of how those memory structures should be thought of as logical constructs. That is, there is no higher order description of how TeX should behave in the code, simply procedures to take TeX and produce almost raw bytes, and then slurp up those raw bytes into DVI output, with certain very carefully designed algorithms for math layout and paragraph breaking hidden inside.

Knuth's approach to the lack of abstraction in the Pascal expression is to add English prose, both in the WEB format of the code, and in the TeXbook's user documentation, but not to present a more abstract, formal, or diagrammatic description of the architecture. (He does describe some small algorithms in more precise ways, such as the DVI optimizations using down_ptr and right_ptr.) I would hope to replace this by an elegant Lisp description of the underlying structures and algorithms, leaving the messy hand-made arithmetic and memory management off to the side as part of the unfortunate need to run on 1982-era platforms. But I get the nagging feeling that the whole design of TeX the language and the data structures and the algorithms is just one big ball of mud that will resist such analysis.

I am slightly annoyed by the use of numerical coincidence (e.g. using the numerical ordering and grouping of an enumeration to simplify some decision logic) and a few related hidden limits (must get my copy of TeX, The Program to see where I wrote one in the margin). Seems rather too low-level for someone trying to present things clearly.

Another one of my past TeX-related projects was to try to get a from-scratch Pascal-based WEB system producing TeX on Mac OS X, so that I could really understand the WEB code, e.g. by enhancing the TeX internal debugger, instead of simply using a Web2c translation.

No comments:

Post a Comment