Monday, July 31, 2006

More adventures in multithreading

Just wanted to capture a few more thoughts; over the weekend, I experienced what I think was my first case of deadlock. My chaosnet-on-OpenMCL implementation had worked just fine when the two sides of a connection were careful to exchange only one packet at a time, taking turns. When I tried to listen to a SEND (the Lisp machine analogue to an instant message), the CADR emulator sent two packets in quick succession, and my OpenMCL process would freeze trying to receive it.

As an aside, I get the feeling that I don't really know how to debug this kind of thing in OpenMCL: looking at backtraces does not always make it completely obvious which semaphore or lock is being waited for in the various threads, and I don't think it is possible to look at semaphores to see which thread(s) are waiting.

In any case, I concluded that my somewhat haphazard use of a "count lock" to protect access to packet number information in a connection was either being used recursively, or violating a lock heirarchy. So, after some time trying to apply my powers of deduction, I decided to look at the ChaosNET problem from a high level to determine what locking is actually necessary.

The main complexity has to do with the transmission window. In theory, it seems like an ideal case for a counting semaphore: the send window is a limited resource, and sending processes must wait until a spot within the window is available. (Generally, one would wish to preserve FIFO order, but strictly speaking, multiple threads trying to transmit on the same channel might not deserve any guarantees that their packets will happen in any particular order.)

However, there are several things unusual about the window: the thread handling incoming packets can get both acknowledgements in ordinary packets, as the receiving user process consumes packets from the window, and in STS packets that could, at any time resize the window. In the usual semaphore model, it is not easy to deterministically reduce the maximum level of the semaphore. I coded an unsatisfactory approach which does a trial-wait-on-semaphore for the number of window packets that have been taken away. But these might not succeed, if the window is filled above the new capacity. In that case, the window must presumably be allowed to drain below the new level before any waiting process gets allowed to put packets in. It might be possible to have a separate integer variable tracking the "excess window consumption," and for processes to loop back and wait again for the semaphore if the window has been "tightened" since they first waited. I don't know if it is possible for threads to maintain their original FIFO order in this case, or whether it matters. (I'm not sure it is reasonable for Chaos processes to reduce the window, once it is declared in the initial STS negotiating the connection, as opposed to widening it to allow for better throughput. The AIM memo is silent on this.)

Interestingly, the draft chaosnet memo does discuss enlarging and shrinking the window, and mentions WIN and WTS packets.

In terms of avoiding deadlock, there are many participants: the user reading process looks at the window to see if the connection might be getting "stymied", and needs an STS to be transmitted. When STS or acknowledgements come in, the retransmission of the receipted or acknowledged packets should be suppressed, and the acknowledgements (but not receipts) open up space in the window. In the Lisp Machine design, the receive-sts et al. use the RECEIPT function to trim the send-pkts list. I felt that contending with the sending/retransmitting process was asking for trouble. Therefore, I chose to have the retransmission process look at the receipt and acknowledgement status to clean up the send-pkts while it needs exclusive access to the send-pkts list anyway. The retransmission process needs access to the send-pkts list to retransmit, and to the counters if it is required to do clean up. Furthermore, updates of the acknowledgement and receipt states should rachet only forward, i.e., a stale packet coming through the pipe should not erase the acknowledgement or receipting indicated by the logically latest packet received up to this point. That requires locking to protect the read-modify-write update of those status numbers.

One interesting gap in the ChaosNET documentation is whether retransmitted packets can or should update their acknowledgement and receipt information to reflect what has happened to the connection in the meantime. The Lisp Machine implementation seems to retransmit the packet byte-for-byte. An alternative would be to update those packets. In practice, I suspect retransmission should be conservative; it is necessary when the receiving process has fallen behind, or the channel is experiencing loss, and the bias should be toward more retransmission than strictly necessary, instead of hoping to "make up for lost time" by advancing the state of the return channel.

I have created enough of a system to work with the SEND transaction, except for one detail: in a Chaos "stream", my functions indicate an EOF condition only when the request for a byte has gone "off the end." I felt the EOF detection should predict whether the next byte request should succeed; but I could not remember how this works elsewhere.

Tuesday, July 25, 2006

Apple II Disk Transfer

It seems that others have also had the desire to upgrade the ADT (Apple Disk Transfer) tool to work with disks other than the Disk II 5-1/4 inch floppy format under DOS 3.3.

ADTPro is a ProDOS-based version of ADT. It should presumably work with all block devices recognized by ProDOS. I e-mailed some code to David Schmidt which allowed ADT to work with the ordinary IIgs serial ports; direct access to the Z8530 chip might allow for faster transfers; that's a potential project.

Friday, July 21, 2006

Wednesday, July 19, 2006

More thinking about multithreaded queueing

Looking back on my previous post about multithreaded queues, and based on a little hacking I was able to do last night on my multithreaded-Lisp-user-mode-Chaos-client, and having read a bit more in the Little Book of Semaphores, (must stop typing 'sempahores'!), in particular the turnstile pattern, and the barbershop problem, there are a few other ways to slice the onion.

  1. A "waiting room", i.e., the privilege of blocking on the queue is exclusive. That actually makes a certain amount of sense when, for instance, Chaos packets are already sorted by the destination connection. Does it really make sense for multiple consumers to be concurrently transacting on a single connection? What would make sense, in a C10k kind of way, is rapidly generating server connections in response to RFCs, and handing these connections to waiting server threads. Or, if a connection is known to have independent incoming packets, or is deliberately recycled, connections can queue through a turnstile, each take one incoming packet, then go off to chew on it. I think the mechanism could be a simple lock.

  2. "Signal enough" can be expanded to include a "turnstile" approach, where a disappointed consumer signals the semaphore on his way "out" so that the next waiting consumer can be disappointed as well. The idea that the reason for the disappointment should be visible to incoming consumers before they block is really optional. The main reason to prefer it is that, if we believe the "disappointment" is permanent, that we will eventually want to discard the queue, but we presumably cannot do so if consumers can continue to block. That is, if the disposal process must race against the arrival of new consumers. Discarding a queue when the semaphore still indicates the presence of (non-existent) packets and no threads are waiting is OK. Discarding a queue when the semaphore indicates no packets, but threads are blocking requires a well-defined mechanism for blocking threads to cleanly deal with a "killed" semaphore. I.e. the "abort" mechanism.

I really like the Little Book of Semaphores. However, I find it hard to be sure that I truly grasp the concepts. I can read through the solutions to the problems, and understand the discussion, but that doesn't give me a huge amount of confidence in my own solutions to slight variations on the problem. Maybe this kind of programming is simply hard.

I've tried casting some of the solutions in OpenMCL Common Lisp. It's hard to verify that things don't work by accident. One thing that wasn't particularly clear to me is whether it is kosher to do things like

(let ((semaphore (ccl:make-semaphore)))
(labels ((client-function () #|...|#)
(server-function () #|...|#))
(let ((thread1 (ccl:process-run-function #'client-function))
(thread2 (ccl:process-run-function #'server-function)))
(#|...some way to wait for both threads to terminate..|#))))

It seems to work, except I haven't yet tried to make a clean "wait for both threads to terminate", and have just waited for the length of a global list to get to a certain point, or similar kludges. The idea, however, that the let can create "global" variables, but the lambdas can have "thread-local" storage, seems not to be formally specified.