Issues on copying files on PC-98 with LOADALL?
This is sub issue from https://github.com/ghaerr/elks/issues/2389#issuecomment-3393394761
I have booted from 1232k floppy and confirmed mount /dev/hda1 /mnt cp /bin/* /mnt/tmp
gets the same stuck problem if hma=kernel xms=on #xms=int15
It succeeded if #hma=kernel #xms=on #xms=int15 or hma=kernel #xms=on #xms=int15 or #hma=kernel xms=on xms=int15
So, it looks LOADALL problem.
Thank you.
I have tried one more variation, using LOADALL
#hma=kernel xms=on #xms=int15
Unexpectedly this succeeded.
It might only occurrs when hma=kernel and LOADALL both are enabled.
Hmm, something still might be wrong with above case too. I couldn't remove some files properly. (EDIT: some wrong files have been created)
Perhaps try copying floppy to floppy (instead of hard disk) or to same floppy to make sure problem isn't the hard disk driver or differing sector sizes.
Hello @ghaerr ,
I need to prepare another floppy disk so I will try that later but is there difference using int1F or LOADALL on the driver? (It looks int1F is working.)
Thank you.
Hello @tyama501,
There shouldn't be any explicit difference in the driver between INT 1F and LOADALL, they are both handled underneath the xms_fmemcpy function call. We can take a deeper look at that when a more definitive error case is found.
From your screenshot above, it looks like the /mnt/tmp directory entries have been written with garbage. You might try ls /mnt/tmp to get a better look.
Whether this is related to INT 1F, LOADALL or possibly sector size differences between FD1232k and HD is hard to tell at this point. LOADALL is a non-BIOS rewrite of the BIOS INT 1F call. I had thought we tested LOADALL pretty heavily, but it's always possible a register was not saved or perhaps another area of memory being overwritten (IIRC the LOADALL CPU buffer is directly next to DMA buffer?) Also could be related to HMA as you have seen. At this point, reducing variables and keeping error reproduction as simple as possible will help.
Does this occur only on real hardware, or can you duplicate on Neko or DosBox-X?
Thank you!
Neko looks ok. It is using UNREAL mode.
EDIT:
fsck-dos.os2 after copying files from fd1232 to hda1 on Neko.
Thank you.
Hello.@ghaerr. Hello.@tyama501 I wrote Loadall code before. I did find bug on loadall.S. line 174 mov %bx,%es:(0x30) #BX is block operation code 830h does'nt mean BX register and line 176 //mov %dx,%es:(0x2e) #the following not required to be preserved 82eh also doesn7t mean DX register.
these are reverse. thank you!
Hello @drachen6jp,
Wow! Very nice catch on incorrect registers being setup in loadall routine!!! What is most surprising to me is that this ever worked in the first place, since the value of DX (instead of BX) was being used to determine the BX operations code (0=copy, 1=clear).
Hello @tyama501, I have made the following patch for your testing and will commit it shortly:
diff --git a/elks/arch/i86/lib/loadall.S b/elks/arch/i86/lib/loadall.S
index cce9a833..a8c32335 100644
--- a/elks/arch/i86/lib/loadall.S
+++ b/elks/arch/i86/lib/loadall.S
@@ -171,9 +171,9 @@ loadall_block_op:
mov %sp,%es:(0x2c)
//mov %di,%es:(0x26) #comment out sets SI, DI zero for REP MOVS
//mov %si,%es:(0x28)
- mov %bx,%es:(0x30) #BX is block operation code
+ mov %bx,%es:(0x2e) #BX is block operation code
mov %cx,%es:(0x32) #CX is byte count for block operation
- //mov %dx,%es:(0x2e) #the following not required to be preserved
+ //mov %dx,%es:(0x30) #the following not required to be preserved
//mov %ax,%es:(0x34)
.byte 0x0F,0x05 #loadall opcode - execution continues at new IP
# not reached
Hopefully this fix the problem you are seeing with the copy.
Thank you!
Hello @drachen6jp,
What is most surprising to me is that this ever worked in the first place
After studying a bit more, I think the reason it worked with wrong programming is that the DX value was never saved (for speed reasons, since it is allowed be clobbered), thus the value of BX was always being set to 0 since 0830 was always 0. This caused operation to always be COPY. When filesystem sometimes issued command for CLEAR, this would either corrupt buffer by not zeroing or copy to invalid locations in passed GDT, crashing system.
Thank you!
fsck-dos.os2 after copying files from fd1232 to hda1 on Neko.
It would be interesting to see the results of fsck-dos.os2 on hda1 damaged by LOADALL problem on real hardware or Neko running LOADALL mode. This would tell us more about how good a disk checker fsck-dos really is.
Hello @ghaerr and @drachen6jp ,
I got a good news and a bad news.
First of all, here is the fsck-dos.os2 result requested by @ghaerr of the broken files.
Does the fsck modify the boot block signature? It says so after running it once although it can still boot.
[good news] I could install and copy files using LOADALL when #hma=kernel xms=on #xms=int15 Thank you!
[bad news] It still get stuck if I enable hma=kernel when copying.
I tried cp -v /bin/* /mnt/tmp/ a couple of times
and once I have gotten A20 LINE ERROR (hardware message from PC-98).
Other time it just stuck here.
[good news] I could install and copy files using LOADALL when #hma=kernel xms=on
That is good news, so @drachen6jp's LOADALL fix is working.
[bad news] It still get stuck if I enable hma=kernel when copying. once I have gotten A20 LINE ERROR (hardware message from PC-98).
I am guessing this is a separate issue from LOADALL. This could easily be caused by the PC98 BIOS if it switches A20 line for any reason while ELKS is running.
Did PC-98 work well before using xms=on (not xms=int15?).
Next step would be testing without XMS but with HMA:
hma=kernel
#xms=on
#xms=int15
I would also guess that hma=kernel will work fine with xms=int15, but that should also be tested. If BIOS uses A20 switch ever, then running kernel in HMA won't work, as the timer interrupt could cause interrupt into HMA area when A20 is OFF.
First of all, here is the fsck-dos.os2 result requested by @ghaerr of the broken files. Does the fsck modify the boot block signature? It says so after running it once although it can still boot.
That is interesting, we are just starting to learn about fsck-dos, since it hasn't been used much before on corrupted filesystems, it seems. I was unaware it tries to rewrite boot block, but it seems to do so.
I can see how fsck might complain on 1232k floppies with 1024 byte sectors, since AA55 would be in wrong place (although I think PC98 boot still has additional boot marker both at 510 and 1022, right?). But this seems to be on hard disk with 512 byte sectors, correct?
A quick test would be to use hd /dev/hda1 | more to inspect the first 512 bytes of image when corrupted. Then look to see what bytes 510-511 look like before and after corruption/fsck.
once I have gotten A20 LINE ERROR (hardware message from PC-98).
It is curious that PC98 BIOS has error message for A20 line error: this would seem to me that the BIOS cares about the state of A20 line, which likely also means it requires A20 to be in different states? If so, this would answer why hma=kernel is not and won't work on PC98. Is there any BIOS documentation about A20 use that you have seen?
Hello @ghaerr , I am a little bit busy and sorry for the late reply.
Did PC-98 work well before using xms=on (not xms=int15?).
I think so, as I wrote https://github.com/ghaerr/elks/issues/2398#issue-3505937003
I would also guess that hma=kernel will work fine with xms=int15
It seems that elks automatically disables xms buffer when hma and int15 are both enabled.
Is there any BIOS documentation about A20 use that you have seen?
I don't have the BIOS documentation for the errors and I don't know how the PC-98 detecting A20 error, but it seems that this error message is well known for PC-98 users and appears when A20 I/O operation failed .
I am thinking to add some debug codes in cp command to check A20.
Thank you.
Did PC-98 work well before using xms=on (not xms=int15?). I think so
Ok, so is seems that the LOADALL functionality is working without failure, except when kernel is HMA, which is consistent with what we were previously thinking.
It seems that elks automatically disables xms buffer when hma and int15 are both enabled.
Yes, since most (but not necessarily all) INT 15/1F BIOS code depends on A20 state, ELKS current auto-disables XMS with the following code and message:
#define AUTODISABLE 1 /* =1 to disable XMS w/HMA if BIOS INT 15 disables A20 */
...
#if AUTODISABLE
if (kernel_cs == 0xffff) {
/* BIOS INT 15/1F block_move disables A20 on most systems! */
printk("disabled w/kernel HMA and int 15/1F\n");
return XMS_DISABLED;
}
#endif
I have not seen this message in any of your screenshots - is this being displayed or not? If it is not displayed, this may be the reason for the failure, that both INT 1F and HMA are inuse/enabled simultaneously.
it seems that this error message is well known for PC-98 users and appears when A20 I/O operation failed
I see. That would seem to indicate that INT 1F was enabled. We will know more when it is verified whether HMA and INT 1F are in fact both enabled also or not.
I am thinking to add some debug codes in cp command to check A20.
It might be better to add debug output in the kernel BIOSHD driver just before BIOS I/O. This might not be necessary if we find that for some reason HMA and INT 1F are both enabled (e.g. if AUTODISABLE is off for some reason).
I have confirmed int1F closes A20, so AUTODISABLE is also necessary for PC-98 (and yes it is already AUTODISABLE)
So, current issue for my hardware is just for LOADALL case. I will check my hardware failure using LOADALL with this debug cp when I have time. (I don't think LOADALL close A20 but anyway...)
Thank you.
I have confirmed int1F closes A20, so AUTODISABLE is also necessary for PC-98 (and yes it is already AUTODISABLE)
Can you check the boot screen for the "xms: disabled w/kernel HMA and int 15/1F" message?
int1F closes A20
Does "close" mean "off"? That is, A20 is disabled/off? In that case, ELKS should crash immediately when cp makes system call, as that would vector to kernel in HMA. When vector takes to FFFF:XXXX in HMA and A20 off, system will execute code in low memory 0000:XXXX instead of FFFF:XXXX.
current issue for my hardware is just for LOADALL case.
All LOADALL code should be disabled (automatically) when kernel is HMA. This would mean no XMS, just HMA. If this is all true, then problem may be that PC98 BIOS still enables/disables A20 during certain disk I/O or other calls, which would still crash kernel as described above.
So it seems we don't yet know whether issue is XMS code related, or PC98 BIOS, or kernel HMA.
Thank you.
EDIT: I couldn't take screen shot but the above emulator case is #hma=kernel xms=on xms=int15 and yes there was a message "xms: disabled w/kernel HMA and int 15/1F" if hma=kernel
I have added similar code(but written in C) with a20_pc98.inc in arch/i86/lib to check A20 in cp. Above case is not in hma so it does not crash.
It says A20 open if LOADALL is used in the emulator.
I would like to try this with and without hma in the real hardware.
BTW, it seems that @drachen6jp can't reproduce this issue with his hardware, so my harware might be special case.
Sorry edited some message above.
Hello @ghaerr ,
I got some additional information. Here are the screen shot from my PC-9801RX when hma and LOADALL buffer are used and copying files from fd1232 to hdd.
hma=kernel xms=on #xms=int15
I couldn't get A20 close message but
it seems that every time I do cp_debug -v /bin/* /mnt/tmp
it get stuck when copying fm.
Once I have gotten bioshd: I/O read error
If I reboot PC and check /mnt/tmp
then there are no files at all. (Possibly buffer issue?)
EDIT:
If I copy just one file fm then it seems suceed. (I did all copy again after this and I got ADRESS 20 LINE ERROR hardware message...)
If hma is disabled no issues are found.
So, yes, as you mentiioned we don't yet know whether issue is XMS code related, or PC98 BIOS, or kernel HMA.
Thank you.
In general, trying to figure out what is happening at the kernel buffer layer by watching cp results and A20 will be very complicated, since most application I/O is performed within buffers and no physical I/O performed until no more buffers available. This is likely why system is crashing always with fm, and also BIOS I/O error in same spot - since no more buffers are available at that point. Better would be to insert sync() call within cp.c after each copy. But still not the easiest way to debug kernel issue.
Did HMA ever work with PC-98? This is a tricky question since HMA could hav appeared to work until buffers were full and I/O occurred that changed A20 line as a result of BIOS call. I am thinking perhaps HMA never worked properly on PC-98 unless perhaps this is an issue only with your PC98 real hardware (unlikely) although @drachen6jp cannot duplicate any errors, correct?
How about with
hma=kernel #xms=on #xms=int15
Does everything work without error? This would indicate that LOADALL XMS code may still be faulty in peculiar way.
Yes, last time I tried above hma and convetional memory buffer seems work well. I check again when I have another time.
Hello @ghaerr ,
I had a time, so I had another look. For the hma and convetional memory buffer, it works great.
I also have enabled strace function when hma and XMS memory.
It hangs after read (and not much additional information).
I found the '|' on the right corner (not in the picture) is keep rotating so the timer_tick is still working.
Does the timer_tick code is in the lower memory?
I have also tried sync=30 that introduced here, but that does not improve it. https://github.com/ghaerr/elks/pull/1147
This might be related, but I don't know. https://github.com/ghaerr/elks/issues/1367
Thank you.
Hello @tyama501,
I have also tried sync=30 that introduced here, but that does not improve it. https://github.com/ghaerr/elks/pull/1147 This might be related, but I don't know. https://github.com/ghaerr/elks/issues/1367
I will look at those to see if they might be related.
Does the timer_tick code is in the lower memory?
No - when HMA on, all of the kernel near code segment is at FFFF, in HMA. Only the far kernel code segments (FARPROC, FATPROC, GENPROC, BFPROC, DFPROC) (e.g. FAT filesystem, gen HD I/O, BIOSHD, direct floppy drivers) are in low memory.
I would say at this point we need to learn whether the crash is the result of A20 being changed by the BIOS/timer/etc somehow, which would cause a crash whenever a non-FARPROC kernel routine is called if A20 is off, OR the possibility that our LOADALL routine is causing the problem.
If the PC98 BIOS depends on the contents of the fixed LOADALL addresses at 800-865H to not be changed, that could cause a problem in the BIOS. Our LOADALL routine clears all the values 800-865H every time is is executed. In order to test this, setting HMA to OFF (e.g. not using hma=kernel) could be used.
Does the PC-98 run perfectly when xms=on but no HMA? If so, then this would seemingly rule out LOADALL as the problem.
Next perhaps we consider whether BIOS entry is causing issues with kernel in HMA. We could put a call or wrapper around all BIOS calls (INT 1B in bios-1B.pc98.S) and possibly INT 1F in bios1F-pc98.S. INT 1F should NOT be called when LOADALL is being used. In order to do this, we could rename the lower level routines to __call_bios and __block_move respectively, and then write C routines with the original name that call printk before and after the call to the lower level routines.
If an INT 1F routine is being called when LOADALL is in effect, that's a bug. Otherwise we may be dealing with a BIOS issue. To test that, can you test xms=int15, hma=kernel? That would replace LOADALL with INT1F, but still have HMA on. I assume that the PC98 works perfectly with xms=int15 but HMA off, right?
This sounds complicated, but it isn't really, we just will need discipline in eliminating the variables to determine which part of the configuration is causing failure.
Thank you!
Hello @ghaerr
Thank you. I tested more tonight, and things are getting more and more complicated...(I am getting tired...)
It seems that the latest image and the driver is very unstable for my system even hma and xms are off. (may be different issue) The drives sound abnormal when copying (writing) files or doing dd. (Too much seek?)
Video on dd to this drive but it also gets like this when copying files.
https://github.com/user-attachments/assets/c33bb98a-a3f7-4bb6-a4e8-9259ee21acfb
It takes long long time copying with this heavy sound. I tried the different drive / disk but it is same.
I don't know it is the effect of the recent updates.
Does the timer_tick code is in the lower memory?
I see: since the timer spin is still being shown on the display, that seems to show the the A20 gate is ON, which is required for HMA.
Rereading this entire PR, it seems that a crash only occurs when LOADALL is on (that is, xms=on hma=kernel and never when xms=int15 hma=kernel, right? Also never when HMA not being used. This seems to point to a conflict with A20 gate being managed (i.e. set only once to ON at kernel startup) and our LOADALL routine, rather than using xms=int15, which uses internal PC98 BIOS INT 1F routine for block move.
It occurs to me that, since you have screen display of ADDRESS LINE A20 ERROR, that perhaps when INT 1F is used, A20 is always set back to ON, whereas when our LOADALL is used, A20 is never touched. This could mean that another part of PC98 BIOS may be fiddling with A20 line we don't know about: does NMI ever get called on PC98?
You could possibly try calling enable_a20_gate() before every call to loadall_block_op() in xms.c to see if that helps. This would set A20 ON just like normal bios_block_move() does when using INT 1F.
It seems that the latest image and the driver is very unstable for my system even hma and xms are off. (may be different issue) The drives sound abnormal when copying (writing) files or doing dd. (Too much seek?)
The reason for that sounds is likely the BIOS retries that are occurring, as also shown on screen. Is the floppy OK, perhaps try with different floppy?
There have not been other major or even minor changes to PC98 BIOS driver lately so this is likely a different issue, bad media.