UPTIME leaves stopped MLDEV jobs around
I have UPTIME running hourly on ES. Frequently, I log into to find one or more dead (stopped) MLDEV jobs lying around:
14 PFTHMG JOB.05 SYS _10!0 ? DSN 3 3 0% PCLSR .VALUE 20
16 PFTHMG JOB.04 SYS 10!0 ? DSN 3 1 0% PCLSR .VALUE 22
The PC is 2067 (LOSE2:) and the instruction there is .VALUE
The instruction that precedes this instruction (2066, LOSE1:) also has a .VALUE. This means that the actual reason for stopping was the .VALUE at LOSE1. The only way MLDEV transfers to LOSE1 is by doing a "JRST @CMDTB(A)" where A is 0. Inspecting the A register at the time of fault shows that it is, indeed, 0. The command table is laid out thus:
;SLAVE REPLY ROUTINES DISPATCHED TO WITH LH(B) = - # ARGS MUST BE READ FROM NET
CMDTB: LOSE1
NTDI
OPNSI
OPNSO
EOF
FDELST
XNOOP
XACC
XCALL
XICLOS
XOCLOS
XIOC
This suggests that the slave (MLSLV) responded with a reply of 0, which the above code suggests is invalid. The call to jump through the command table is preceded by:
NTINT: PUSHJ P,NTCHK
JRST GOLOOP
PUSHJ P,REPLY1
CAIL A,RMAX
.VALUE
MOVEM B,LREPLY'
JRST @CMDTB(A)
The call to NTCHK (from the comments) tests the input network channel status. If there is input, NTCHK does a skip return. Since we are getting to the JRST @CMDTB(A), we must have had input and skip-returned. We then call REPLY1. This processes a reply from the server (MLSLV on the destination host). This does an NTIIOT and then moves the result to the A register. It checks for a RLOGIN reply and doesn't skip, but returns the result in A.
I was curious to see what host MLDEV was talking to. I think the device should be found in the JBCDEV address. That value was $1'DSK'. Maybe it is in FDEVN -- that value was also $1'DSK'. I expected to find a foreign address here. My UPTIME DATA file has entries for DB, ES, NO, and UP in it. I wonder if this is happening due to the entry for ES and it is the attempt to read M.F.D. (FILE) on DSK: that is dying. I checked both of the dead jobs I saw in PEEK, and they both have $1'DSK' as the device.
Does anyone know what happens when you attempt to access a file on the ES: device from ES? -- that is, rather than using DSK:, you use the machine name as the device? JOBDEV ES is linked to ATSIGN MLDEV. But is there any short circuiting done because the device (ES:) is specifying the localhost?
I'm going to patch my UPTIME DATA to not use ES, but rather some non-existent host, like MC. I will look to see if this cures my problem of dead MLDEV jobs.
I'm wondering, since all the deal MLDEV jobs are for the device DSK, whether these only fail when the host is my localhost.
I just confirmed that MLDEV replaces the contents of JBCDEV and FDEVN with $1' DSK' in some situation. I expected to have it do this if the host specified in these locations originally matched that of the local host. But I don't see how it does this. Here is the relevant code:
MOVEI A,BUF
PUSHJ P,NETWRK"HSTLOOK
.VALUE ;Host not in host table?
MOVEM A,HOSTN' ;Host number
MOVEM TT,NETWRK' ;Network number
PUSHJ P,NETWRK"HSTUNMAP ;Don't need host table any more
.VALUE
LDB A,[3000,,JBCDEV] ;Right-hand four characters
JUMPN A,.+2
MOVE A,[SIXBIT / DSK/]
LSH A,14
MOVEM A,JBCDEV
MOVEM A,FDEVN
Note the "MOVE A,[SIXBIT / DSK/]" instruction. It is executed when the contents of A is not equal to 0. The instruction that setup A was "LDB A,[3000,,JBCDEV]" and the comment is right-hand four characters. The original contents of JBCDEV was $1' ES', so the right-hand four characters would be $' ES', which would never be equal to 0. So it appears it replaces these with $' DSK' in all cases (unless, I guess, the value of JBCDEV was 0).
So the fact that I found $1' DSK' in these locations in my post-mortem doesn't appear to prove that we're trying to perform file system i/o on ES. I would have had to look at the HOSTN location, where the host number is stored after the host lookup (and before patching JBCDEV and FDEVN). Next time I see one of these jobs dead, I'll look in HOSTN and NETWRK (the latter holds the network number).
And it appears no short-circuiting is happening. Even when the target host is the same as the local host, we still attempt a Chaosnet connection to the localhost. And when I step through MLDEV (debugging using the OJB device), it works every time. No JSRT to LOSE1, no .VALUE, and I see the resulting DSK:M.F.D. (FILE) output on the console of the i/o requesting job.
So I'm no close to understanding why sometimes this fails.
I've also confirmed that in this "local" case (when the MLDEV device is ES: and that is the machine we're on), that ITS runs an MLSLV handler, with which MLDEV is communicating. So it must be MLSLV, that in some situations is returning a bogus reply that causes MLDEV to .value.
I wonder how you debug one of these handlers (MLSLV). They may well be launched with the JOB device, or maybe some other mechanism in ITS, but I don't think they are subject to the same translation hack for OBJ: because they are not run as jobs under a DDT, and thus I don't know how to make translations apply to them.
It's easy to debug MLSLV. Just run it under DDT! It detects that it is being run this way and listens for connections over the chaos net (if jname is MLDEV) or tcp net (if jname is TCP). I just set up a debugging MLDEV in one HACTRN, a ES:M.F.D. (FILE) requesting job in another HACTRN, and an MLSLV in a third HACTRN. I'm able to debug both ends of the MLDEV <-> MLSLV connection that way.
But of course, i can't reproduce the .VALUE problem described above, and a perusal of the MLSLV code shows no obvious reason it would reply with a 0 rather than the various reply codes, which are defined to be >1.
Well, it turns out it is not an issue with a local host/device. I got the same error with an MLDEV instance talking to NO over chaosnet. I found a crashed MLDEV whose PC was LOSE1. We get there when the response from MLSLV is 0 (not a valid response). I verified that the host name that was resolved was "NO" and that the resolved host number was 40700003150, which is NO's chaos address. So we either have a bug in MLSLV or data is getting garbaged somehow.
I did noticed a couple times when I did a NO^F from ES, that the directory listing had garbage in the middle of it. So I suspect the issue is not with MLSLV, but with the robustness of the chaosnet over UDP that we're using with the emulator.
I've not dug into this, but I've had occasional garbage in NO files accessed from UP. It seems very odd, since I'd expect the UDP checksums to catch this - unless it's garbage already when packed into UDP packets. I've only seen this from NO, never from other machines.
If it is indeed a UDP-related problem, it should be solved by using the new Chaos-over-TLS option instead. My experience (using it from home to MX-11, and the rest of the net through it) is that it's very often quite a bit faster than the UDP option, although I don't really see why. I'll send you instructions for setting it up, separately.
Try removing NO from the data file for say a week.
In trying to debug why my new EX ITS can't talk to my ES ITS, I noticed that when EX (or ES) tries to send chaosnet packets, when it sends through no.nocrew.org, I get very, very frequent checksum errors in the UDP packets:
01:38:31.609220 IP (tos 0x0, ttl 64, id 62274, offset 0, flags [DF], proto UDP (17), length 92)
ip-10-0-0-55.ec2.internal.42043 > static.74.191.99.88.clients.your-server.de.42042: [bad udp cksum 0x223e -> 0x9402!] UDP, length 64
0x0000: 4500 005c f342 4000 4011 256a 0a00 0037 E..\.B@.@.%j...7
0x0010: 5863 bf4a a43b a43a 0048 223e 0101 0000 Xc.J.;.:.H">....
0x0020: 0900 0025 0668 cc21 0671 8827 0004 0004 ...%.h.!.q.'....
0x0030: 6f43 6e6e 6365 6974 6e6f 6420 656f 2073 oCnnceitnod.eo.s
0x0040: 6f6e 2074 7865 7369 2074 7461 7420 6968 on.txesi.ttat.ih
0x0050: 2073 6e65 e064 0668 0671 288c .sne.d.h.q(.
I don't see any of these when sending/receiving packets through up.update.uu.se.
Spoke too soon. Now I'm seeing them through up.update.uu.se too. Oh well. Thought I'd found the source of the issues with NO.
It has seemed to me that NO talking Chaosnet is particularly slow. If the checksum errors are more frequent to/from NO than other, that may explain the slowness.
Just had another dead MLDEV -- same error -- bad response from MLSLV. And the target host was NO, again. When these die, I load symbols and check the HOSTN contents:
*hostn/'BOJ◊: .IOT IOP,PKTBUF+12 =40700003150
As I think I mentioned earlier, MLDEV changes the JBCDEV value from 'NO to 'DSK prior to opening the connection to the remote host, so HOSTN is the place to look to see which host it was talking to.
Googling the UDP checksum problem, it seems it might be related to "hardware offloading of checksums", which seems to be used e.g. in virtual environments. It can be turned off with ethtool, perhaps using
ethtool --offload eth0 rx off tx off
but YMMV, of course. Maybe you should google yourselves first. :-)
Wanna try that on NO, Lars? I can report on whether the errors go away.
Ok, tx checksum offloading is off. Couldn't change rx.
Eric, did you see any any differences at your end after the offloading change?
I haven't seen any MLDEV hangs from NO since your change and UPTIME is always showing up-to-date up times for NO.
Bummer. I got a dead MLDEV job on ES. It's PC was the same .value (LOSE1) And the HOSTN indicated it was NO that caused the issue. So I guess the tx checksum fix didn't fix this.
While I haven't yet seen any MLDEV jobs dead on ES or EX, as a result of bad data being returned by MLSLV on one machine or the other, I have seen some data corruption when I've tried to list directories or retrieve files between the two machines. This is the same corruption that I've seen when using the chaosnet over udp with NO. I think the chaosnet-over-udp implementation in KLH10 is just not robust with respect to errors. When I ran ES, EX, and the chaosnet bridge on the same linux VM, I had no issues. When (due to memory issues on my instance) I moved EX to another instance, and thus had to go over the Internet for my chaosnet traffic between ES and EX, I started seeing corruption -- not always, of course, but periodically.
While I haven't yet seen any MLDEV jobs dead on ES or EX, as a result of bad data being returned by MLSLV on one machine or the other, I have seen some data corruption when I've tried to list directories or retrieve files between the two machines. This is the same corruption that I've seen when using the chaosnet over udp with NO. I think the chaosnet-over-udp implementation in KLH10 is just not robust with respect to errors. When I ran ES, EX, and the chaosnet bridge on the same linux VM, I had no issues. When (due to memory issues on my instance) I moved EX to another instance, and thus had to go over the Internet for my chaosnet traffic between ES and EX, I started seeing corruption -- not always, of course, but periodically.
For example, here is a fragment of a directory listing of a directory on ES from EX:
0 TS BKG 21 3/22/1978 04:40:04
0 TS CLD 1 3/9/1979 19:53:03
0 TS D ≥·@@EE<j^IADp@EMhdfiMd≠∀↓↓@@@αRL@@↓↓↓∀@↓↓·@@E↓·@h=I\^be]`@bIiLrtIL4∀@↓A··· TS EG 6 1/16/1979 17:01:17
0 TS EL 1 1/27/1978 07:14:13
0 TS EM 6 1/16/1979 17:10:43
Interesting. At least there's a clear indication something's wrong.
Could you post your "devdef chaos" and "link chudp" lines from you configs? It's really strange that none of the checksums (both UDP and CHUDP use them) detect this. Did you double-check the checking in your version of dpchaos.c (should be in chaostohost_chudp(), search for ch_checksum)?
If the checksums work, the error would be somewhere before sending the packet, or after receiving it. Concurrency bug?