copy to a phantom B: drive soft-fails sometimes (int 24h)
boot the attached floppy image with:
qemu-system-i386 -fda floppy.img -vga cirrus -m 4M -cpu 486 -machine isapc -no-hpet
Then, in the DOS prompt:
COPY BIG.A B:\BIG.B
Then press any key 8 times (no need to actually swap any diskette). Result:
Interestingly, if one chooses to (R)etry, then the operation eventually finishes and the copy is successful (confirmed by pkg crc32 big.a).
Now do a COPY KERNEL.FD KERNEL.SYS and restart the VM. Redo the test: it works fine and a proper copy is performed.
Secondary issue
If you look at the int 24h message, it says a write error occurred on drive A:. This is confusing because the destination was B:, not A:. I suspect that when the kernel issues the int 24h call, it forgets to translate the drive letter and keeps the BIOS id.
PS. also tested with VBox, same results
If you look at the int 24h message, it says a write error occurred on drive A:. This is confusing because the destination was B:, not A:. I suspect that when the kernel issues the int 24h call, it forgets to translate the drive letter and keeps the BIOS id.
Since this happens after having "inserted" a diskette into A, it is more likely that the drive is correct, but the operation string is wrong (should not be "write" but "read"). The message itself comes from the SvarCOM int 24h handler and is based on the lowest bit of AH, as set by the kernel.
PS2. also tested with MS-DOS 6 - works fine. So MSDOS=works, FREEDOS=works, EDR=fails.
Confirmed. Though it does work with EDR COMMAND.COM.
how many times did you have to swap the floppy with EDR COMMAND.COM?
It looks like the issue is related to the number of swaps, not the amount of data. Seems to happen always after the 8th swap, and then reoccurs in a pattern afer 4- or 2- swaps
Ten times.
But perhaps EDR COMMAND.COM only masks a kernel problem. We should not deal that much with it, but instead try to find out why the SvarCOM / EDR combination does not work. I am afraid this will be a difficult one...
Quickly created a copy test for checking this. Was not yet able to re-produce the error with it. BUT I noticed that the number the floppy has to be swapped is doubled if the read operation is not a multiple of 512. This is probably the result of the kernel doing deblocking.
So as an optimization the buffer size SvarCOM uses should be rounded down to the next multiple of 512 bytes.
The code I used:
#include <stdio.h>
#include <stdlib.h>
#include <dos.h>
#include <fcntl.h>
static char buf[8192];
int main( int argc, char *argv[] )
{
int hndla, hndlb, i = 1;
unsigned err;
unsigned n_read, n_written;
unsigned long offset = 0;
if ( argc != 3 ) {
printf( "Usage: PHANTOM fileA fileB\n" );
return 0x1f;
}
printf( "INT24H CRITICAL HANDLER @ %Fp\n", _dos_getvect( 0x24 ) );
if ( ( err = _dos_open( argv[1], O_RDONLY, &hndla ) ) != 0 ) {
printf( "error %u!\n", err );
return err;
}
if ( ( err = _dos_creat( argv[2], 0, &hndlb ) ) != 0 ) {
_dos_close( hndla );
return err;
}
while ( 1 ) {
printf( "iteration %d, offset %lu\n", i, offset );
if ( ( err =_dos_read( hndla, buf, sizeof(buf), &n_read ) ) != 0 ) {
printf( "read error %u!\n", err );
goto ret;
}
if ( n_read == 0 ) {
break;
}
if ( ( err = _dos_write( hndlb, buf, n_read, &n_written ) ) != 0
|| ( n_written != n_read ) ) {
printf( "write error %u!\n", err );
goto ret;
}
i++;
offset += n_read;
}
ret:
_dos_close( hndla );
_dos_close( hndlb );
return err;
}
Technically I see no good reason why reading not a multiple of 512 bytes would make more disk swaps neccessary. Even if the deblocking buffer is involved, that should not result in more disk swaps. Likely explanation would be that in order to invalidate the deblocking buffer, the disk must be there. I will first investigate this before looking at the original issue...
as an optimization the buffer size SvarCOM uses should be rounded down to the next multiple of 512 bytes.
Thanks for the suggestion. SvarCOM's COPY uses a buffer that is always a multiple of 512 now: https://github.com/SvarDOS/core/commit/4525c40c2660c9528538a7bbb500e01abd316c73
With this new buffer size the int24h bug is no longer triggered. Maybe that's why EDR's COMMAND.COM works fine. And maybe it's a hint that the issue is somewhere in the "deblocking buffer" part.
Quickly created a copy test for checking this. Was not yet able to re-produce the error with it.
My copying code is this: https://github.com/SvarDOS/core/blob/master/svarcom/trunk/cmd/copy.c#L106
But as already stated, it's the size of the buffer that seems to matter.
With this new buffer size the int24h bug is no longer triggered. Maybe that's why EDR's COMMAND.COM works fine. And maybe it's a hint that the issue is somewhere in the "deblocking buffer" part.
Great to hear. Makes this a little less urgent. Nevertheless we should try to find the root cause of this. It is probable that it has to do with the deblocking buffer.
Regarding the increased disk change count: If I understand the code correctly then before flushing the buffers the system makes sure that the inserted disk has not changed, and therefore has to access the BPB in case of the floppy is not accessed for longer than ~3 seconds, making additional drive changes neccessary. But will have a second look...
I think this is somewhat inefficient. Because if there are no dirty buffers I do not see a reason why the test for disk change is neccessary. But as I am not sure, for now I will go the "if it isn't broken, leave it as is" route.
Call stack of the disk flush routine:
flush_buffer (buffers.asm)
select_adrive (bdevio.asm)
check_media (bdevio.asm)
dd_medchk (disk.asm)
ask_for_disk (disk.asm)
In the below screenshot there is some diagnostic info shown which is output in the routine blockif (bdevio.asm). This routine is responsible for calling the actual read / write routine of the device driver.
The diagnostic info is shown in quadruples: first word is disk (high) and access mode (low), second and third words are 32-bit start sector, fourth value is sector count.
The access mode (first word low) is: bit 0 (0=read, 1=write), bit 2,1 (01=FAT, 10=DIR, 11=DATA).
So, interpreting the values we can deduce that accessing sector FFFFFFFF of drive A: failed. The operation is a write to what the kernel thinks to be a directory sector.
The variable containing the illegal sector value is pblock, which by itself is set from fdrwsec. Not yet sure how FFFFFFFF lands in pblock. Neither pblock or fdrwsec are explicitly set to FFFFFFFF anywhere in the source. I have a guess that this may be the result of a calculation error.
@mateuszviste could you please test SvarCOM again with the rounding to 512 bytes turned off but with less bytes per write, something smaller than 32k? Or did you test this already? If the error does not occur with smaller write sizes this is likely to be some number overflow issue.
@mateuszviste could you please test SvarCOM again with the rounding to 512 bytes turned off but with less bytes per write, something smaller than 32k?
That's what the version attached to this issue does. It uses a copy buffer of about 3K (and not a multiple of 512).
That's what the version attached to this issue does. It uses a copy buffer of about 3K (and not a multiple of 512).
Not true, oops. It seems that I created this issue after implementing the big buffer logic in SvarCOM. I will compile the small buffer r2162 version tomorrow and test it.
interpreting the values we can deduce that accessing sector FFFFFFFF of drive A: failed. The operation is a write to what the kernel thinks to be a directory sector.
So it is truly a WRITE to A: ? Then this yields another question: why is A: written to in the first place? It is supposed to be the source.
Seems that flush_buffer is called recursively via its select_adrive calls (indirectly via media_check). But flush_buffer does not seem to be reentrant, because the buffer to be flushed is not re-tested after a potantial recursive flush. This results in the BCB fields containing unusable values for the flush.
I think 5b1835d8b66363e15813f194e86e8e29b7ee7bee + a027cb89e00acc30cddfd35509c4453f4d9f6a58 fixed the issue. Problem was that flush_buffer was called recursively twice, with the second recursion operating on the buffer of the initial flush_buffer call (same drive). Initial flush_buffer call then continued to operate on an already flushed buffer.
Picked into https://hg.pushbx.org/ecm/edrdos/rev/820c1299fa5c
@mateuszviste could you please test SvarCOM again with the rounding to 512 bytes turned off but with less bytes per write, something smaller than 32k? Or did you test this already? If the error does not occur with smaller write sizes this is likely to be some number overflow issue.
Tested now with a buffer 3808 bytes (and old kernel) - problem confirmed to occur.
Upgraded EDR kernel to a027cb8: issue solved. Awesome, thank you! :-)