Tuesday, February 5, 2008

COMFY-6502: work in progress

As I've mentioned before, I've become intrigued by Baker's COMFY assembler, and have been working on porting it to Common Lisp, and making it a bit more powerful in the link stage.

One metric to judge the success of this kind of "medium-level" language is how well it compiles compared to hand-written assembler code. For the 6502, there are a few examples of code created by wizards like Steve Wozniak, which you can find copies of around the web, the largest being the Apple II monitor, the Integer Basic interpreter, and some medium-sized ones like the Apple II 6502 step/trace, the mini-assembler, the floating-point "Wozpack," and the Sweet-16 virtual machine.

This kind of code has a lot of quirks that make it hard to straight-forwardly translate: lots of shared "tail code", branches known by the programmer to always be taken (to save a precious extra byte consumed by an unconditional JMP), and the classic "fake RTS trick", pushing a return address picked from a table onto the stack, then using an RTS instead of a zero-page indirect JMP. Common in Woz's code is a further shortening of the code by arranging the destination addresses to all be in the same 256-byte page, so the high-order byte is a constant. Some of these ideas will likely be possible with intelligent macros, combined with address labels and computation on those labels. I'm puzzling a bit over how to optimize the "same page" condition: whether to include a "link-time assert", which will issue an error if the code is emitted so as to cross a page, or an even more intelligent "link-time computation" which, given the current available memory space, can choose a location that meets the constraints.

That's all still in development. The first test case I used was the classic "Red Book" tone routine.

;; the original
;; location 00: pitch
;; location 01: duration
0002-  AD 30 C0   LDA  $C030 ; whap speaker
0005-  88         DEY
0006-  D0 04      BNE  $000C
0008-  C6 01      DEC  $01   ; y reaches 0: a duration tick
000A-  F0 08      BEQ  $0014 ; duration exhausted when reaches 0
000C-  CA         DEX        ; x counts time to "whap"
000D-  D0 F6      BNE  $0005
000F-  A6 00      LDX  $00   ; x reached 0, reload pitch
0011-  4C 02 00   JMP  $0002 ; and whap
0014-  60         RTS
A few comments:
  1. Locating the code in zero-page needlessly wastes valuable zero-page locations. The code is easily relocatable to other locations.
  2. X counts from the "pitch value" down to 0 at a fixed rate, "whapping" the speaker each time it reaches zero.
  3. Y cycles around continuously at basically the same rate; each time it reaches zero, one tick is subtracted from the duration value. When duration reaches zero, the routine returns.
  4. This classic tone routine has the important feature that duration is absolute, and not dependent on pitch
  5. Y is uninitialized, which presumably causes a slight jitter in the duration, unless BASIC ensures Y=0. Likewise, X in the first pitch countdown is uninitialized.
  6. duration is destroyed by the routine, but pitch is not.

Here is my transcription to COMFY-6502. I have translated the "I", "J", and abbreviated "L", "LI", etc. opcodes for this example, but not yet changed Baker's code to accept them.

(let ((comfy-6502::*symbol-table* (make-hash-table)))
  (equ speaker #xc030)
  (equ pitch 0)
  (comfy-6502-tests::compile-code
    (comfy-6502:loop
      ;; repeat until some clause loses: actually, only exit is
      ;; through return, so each clause should be ensured of winning.
      (LDX pitch)
      (LDA speaker)
      (not (comfy-6502:loop  ;; repeat until whap time
             ;; "whap" time is when a clause loses;
             ;; NOT wins, outer loop continues
             DEY
             (if =0? ; has Y reached zero?
               (not (seq (DEC duration) ; Y has reached zero, duration ticks down
                         =0?       ; if duration has not reached zero,
                         ; SEQ fails, NOT wins, IF wins, inner loop continues
                         RTS)) ; duration reached zero, return
               (seq DEX ~=0?))))))) ; Y not zero, count down pitch.
               ;; If X has reached zero, SEQ loses, IF loses, inner loop exits
To understand the code, the basic COMFY principles must be understood.
  1. Conditional tests (such as "=0?", corresponding to a BEQ or BNE), "win" if they are true, "lose" if they are false.
  2. Ordinary machine operations (such as LDA, DEX, RTS) always "win"
  3. SEQ sequentially each enclosed form, until one "loses", in which case SEQ "loses", or until they all "win", in which case SEQ "wins."
  4. LOOP executes each enclosed form, until one "loses", in which case LOOP "loses", continuing at the top of the loop if every form "wins." (I have introduced an "implied" seq to Baker's ELisp implementation.) It is SEQ stuck on potentially endless repeat.
  5. NOT executes the single enclosed form, but changes "win" into "lose" and vice-versa. This is useful to keep a nested "loss" from terminating all the enclosing forms.
  6. (IF test win-form lose-form) executes TEST. If TEST "wins", WIN-FORM is executed, and its win/lose result is passed out as the result of the IF. If TEST "loses", LOSE-FORM is executed, and its win/lose result is passed out as the result of the IF.

Note that I have chosen to initialize X on the first pass through the loop. That makes the code somewhat clearer, and I think is better style.

So what code does COMFY (as of today) generate for this? Behold.
(; reload-pitch
(LDX :ZERO-PAGE) (:ZERO-PAGE 0)
(LDA :ABSOLUTE) (:ABSOLUTE 49200)
; spin
(DEY)
(BEQ) (:BRANCH 6) ; duration-tick
(DEX)
(BNE) (:BRANCH 8 ) ; goto spin
(BEQ) (:BRANCH 9) ; goto reload-pitch
; duration-tick
(DEC :ZERO-PAGE) (:ZERO-PAGE 1)
(BNE) (:BRANCH 2) ; goto-spin
(RTS)
; goto spin
(JMP :ABSOLUTE) (:LONG-BRANCH -14)
; goto reload-pitch
(JMP :ABSOLUTE) (:LONG-BRANCH -22))
This is a symbolic "intermediate" form, not yet relocated and linked. Some keys for interpretation.
  1. (BEQ), (JMP :ABSOLUTE) etc. represent one-byte opcodes, with the specified addressing mode (redundant in the case of branch instructions, which have only the branch-relative mode).
  2. (:ZERO-PAGE 0) is a single-byte, value 0, tagged to indicate that it represents a reference to zero-page location 0, and is not just a immediate argument of value zero.
  3. (:ABSOLUTE 49200) is a two-byte reference to the address #xC030, which will be emitted in the usual 6502 low-byte first as #x30 and #xC0.
  4. (:BRANCH n) represents a short branch. It is one byte long. Note that the destination counts relative to the BRANCH byte, not relative to the next instruction, as the 6502 branch values do. (The relocater corrects for this when converting to 6502 code bytes.)
  5. (:LONG-BRANCH n) is a long branch. It is two bytes long. JMP is absolute in the 6502; the relocated will deposit the absolute address by adding this value to the location of the first byte of this pair.

I have added this intermediate form because Baker's Elisp goes directly to bytes in a fixed memory area, starting from high addresses. My goal is for a high level "COMFY module" macro to result in an intermediate form that knows its length in bytes, what zero page locations it uses, and what internal entry points it exports. Otherwise, it is relocatable by the linker. I would also like the feature that the zero-page references are symbolic (e.g. (:ZERO-PAGE duration)), in which case the linker would allocate a byte in zero-page, and all modules referring to "duration" would refer to that address. Even nicer would be the ability to do math, for instance (:ZERO-PAGE (+ duration 1)), or to emit (:ABSOLUTE (- tone-entry 1)) to encode the module entry point in the "off-by-one" method used by the "fake RTS" trick. Similarly, one could have elements such as (:BYTE (HIGH tone-entry)), representing the high-order byte of the tone-entry address, and allowing arbitrary link-time math.

As for the code itself, it has the same similar semantics as the Red book routine.[UPDATE: but not quite. See more recent post for correction.] It has a somewhat minor change in that the DEY is followed by a BNE in the original code (taken 255 out of 256 times), and by a BEQ in the COMFY version (taken only 1 out of 256 times). Any difference in the execution time will, of course, change the pitch emitted. More serious is the fact that COMFY, as coded, is biased toward forward conditional branches. The two JMP instructions after the RTS are reached only through branches that could reach their destination directly. COMFY compiles, as I mentioned, from the last instruction toward the first. Forward branches are trivially resolved, but backward branches don't know their destination, in particular, whether the destination is a short branch away or not. The backward JMPs are created by the code for LOOP, which first emits a "placeholder" absolute jump, compiles the loop forms, including branches that might land on that jump, then patches the JMP destination to point to the first form of the LOOP.

That suggests that LOOP might be enhanced to include a "loop-tightening" phase which examines the body of the loop for forward branches (:BRANCH x) to jumps (:LONG-BRANCH y), and if x+y+1 is within range of a short branch, replaces it. (The +1 is for the JMP opcode itself.) The result would be
(; reload pitch
(LDX :ZERO-PAGE) (:ZERO-PAGE 0)
(LDA :ABSOLUTE) (:ABSOLUTE #XC030)
; spin
(DEY)
(BEQ) (:BRANCH 6) ; duration-tick
(DEX)
(BNE) (:BRANCH -5) ; spin
(BEQ) (:BRANCH -12) ; reload-pitch
; duration-tick
(DEC :ZERO-PAGE) (:ZERO-PAGE 1)
(BNE) (:BRANCH -11) ; spin
(RTS)
(JMP :ABSOLUTE) (:LONG-BRANCH -14)
(JMP :ABSOLUTE) (:LONG-BRANCH -22))
The two JMPs are now unused. This is a special case, because RTS is itself a jump. Because LOOP knows that it freshly emitted the JMP, no external code can properly reference it. If all the branches to it have been patched, and the last instruction cannot reach it, the JMP itself can be omitted.Assuming I can convince COMFY to be smart enough to omit the (proven) redundant JMPs, we are ready for comparison. Let's re-write the original code to match the explicit X initialization, and use COMFY's discovery that moving the LDX to the front of the loop means the absolute JMP $0002 in the original can be replaced by a BEQ.
reload-pitch LDX PITCH
LDA $CO30
spin         DEY
BNE not-tick ; actually, takes branch more often
; duration-tick
DEC duration
BEQ return
not-tick     DEX
BNE spin
BEQ reload-pitch ; save a byte compared to original JMP
return       RTS
COMFY, with the yet-to-be-implemented loop tightening has made the code one byte shorter, and relocatable.As mentioned before, COMFY has luckily or unluckily chosen the branch after DEY to be taken infrequently, while the author of the tone routine chose it to be taken frequently. We cannot know whether the author de-optimized to reach a lower pitch range. If so, to match the timing more closely, I can reverse the sense of the IF, causing it to emit a BNE. In this case, the hypothetical loop-tightening can eliminate one JMP, but not the other.
(; reload-pitch
(LDX :ZERO-PAGE) (:ZERO-PAGE 0)
(LDA :ABSOLUTE) (:ABSOLUTE 49200)
; spin
(DEY)
(BNE) (:BRANCH 6) ; not-duration
(DEC :ZERO-PAGE) (:ZERO-PAGE 1)
(BNE) (:BRANCH -6) ; spin
(RTS)
; not-duration
(DEX)
(BEQ) (:BRANCH -15) ; reload-pitch
; goto-spin
(JMP :ABSOLUTE) (:LONG-BRANCH -12)
We see this simple change affected the DEY test, but also affected the DEX test, biasing it to be taken less frequently. Perhaps an annotation could added to the COMFY language to indicate "please prefer (or expect?) this test to win/lose." This would disrupt the tight recursive elegance of the GENBRC routine, but it might prove handy in modern architectures with their branch-prediction tendencies. It is also not so simple to change the test after DEC (:ZERO-PAGE 1) to BEQ to the RTS. That would require moving the RTS to be emitted at the end, i.e. compiled by COMFY first. I'm more interested in implementing the loop-tightening logic than attacking that brain-teaser right now.The moral may be that COMFY (plus yet-to-be-implemented optimization passes) can emit code very much in the ballpark of humans, at least for these toy examples. For cycle-level timing, however, I might follow the lead of Sassy and allow for the human to code branches and jumps explicitly. (Sassy's example "boot sector" code actually avoids pretty much all the COMFY primitives; however, the documentation also seems to suggest that loop tightening and similar optimizations have been implemented there.) And I have yet to successfully translate Woz's code into COMFY, probably because I have to understand it first.(Note to self: AVOID the tempting WordPress "code" button. Use the sourcecode tag, although it supports neither assembler nor Lisp.)

No comments:

Post a Comment