edrdos icon indicating copy to clipboard operation
edrdos copied to clipboard

copy to a phantom B: drive soft-fails sometimes (int 24h)

Open mateuszviste opened this issue 1 year ago • 6 comments

boot the attached floppy image with:

qemu-system-i386 -fda floppy.img -vga cirrus -m 4M -cpu 486 -machine isapc -no-hpet

Then, in the DOS prompt:

COPY BIG.A B:\BIG.B

Then press any key 8 times (no need to actually swap any diskette). Result:

image

Interestingly, if one chooses to (R)etry, then the operation eventually finishes and the copy is successful (confirmed by pkg crc32 big.a).

Now do a COPY KERNEL.FD KERNEL.SYS and restart the VM. Redo the test: it works fine and a proper copy is performed.

Secondary issue

If you look at the int 24h message, it says a write error occurred on drive A:. This is confusing because the destination was B:, not A:. I suspect that when the kernel issues the int 24h call, it forgets to translate the drive letter and keeps the BIOS id.

PS. also tested with VBox, same results

mateuszviste avatar Oct 14 '24 15:10 mateuszviste

If you look at the int 24h message, it says a write error occurred on drive A:. This is confusing because the destination was B:, not A:. I suspect that when the kernel issues the int 24h call, it forgets to translate the drive letter and keeps the BIOS id.

Since this happens after having "inserted" a diskette into A, it is more likely that the drive is correct, but the operation string is wrong (should not be "write" but "read"). The message itself comes from the SvarCOM int 24h handler and is based on the lowest bit of AH, as set by the kernel.

mateuszviste avatar Oct 14 '24 15:10 mateuszviste

PS2. also tested with MS-DOS 6 - works fine. So MSDOS=works, FREEDOS=works, EDR=fails.

mateuszviste avatar Oct 14 '24 15:10 mateuszviste

Confirmed. Though it does work with EDR COMMAND.COM.

boeckmann avatar Oct 14 '24 16:10 boeckmann

how many times did you have to swap the floppy with EDR COMMAND.COM?

It looks like the issue is related to the number of swaps, not the amount of data. Seems to happen always after the 8th swap, and then reoccurs in a pattern afer 4- or 2- swaps

mateuszviste avatar Oct 14 '24 16:10 mateuszviste

Ten times.

boeckmann avatar Oct 14 '24 16:10 boeckmann

But perhaps EDR COMMAND.COM only masks a kernel problem. We should not deal that much with it, but instead try to find out why the SvarCOM / EDR combination does not work. I am afraid this will be a difficult one...

boeckmann avatar Oct 14 '24 16:10 boeckmann

Quickly created a copy test for checking this. Was not yet able to re-produce the error with it. BUT I noticed that the number the floppy has to be swapped is doubled if the read operation is not a multiple of 512. This is probably the result of the kernel doing deblocking.

So as an optimization the buffer size SvarCOM uses should be rounded down to the next multiple of 512 bytes.

The code I used:

#include <stdio.h>
#include <stdlib.h>
#include <dos.h>
#include <fcntl.h>

static char buf[8192];

int main( int argc, char *argv[] )
{
	int hndla, hndlb, i = 1;
	unsigned err;
	unsigned n_read, n_written;
	unsigned long offset = 0;

	if ( argc != 3 ) {
		printf( "Usage: PHANTOM fileA fileB\n" );
		return 0x1f;
	}

	printf( "INT24H CRITICAL HANDLER @ %Fp\n", _dos_getvect( 0x24 ) );
	if ( ( err = _dos_open( argv[1], O_RDONLY, &hndla ) ) != 0 ) {
		printf( "error %u!\n", err );
		return err;
	}

	if ( ( err = _dos_creat( argv[2], 0, &hndlb ) ) != 0 ) {
		_dos_close( hndla );
		return err;
	}

	while ( 1 ) {
		printf( "iteration %d, offset %lu\n", i, offset );
		if ( ( err =_dos_read( hndla, buf, sizeof(buf), &n_read ) ) != 0 ) {
			printf( "read error %u!\n", err );
			goto ret;
		}

		if ( n_read == 0 ) {
			break;
		}

		if ( ( err = _dos_write( hndlb, buf, n_read, &n_written ) ) != 0
			 || ( n_written != n_read ) ) {
			printf( "write error %u!\n", err );
			goto ret;
		}

		i++;
		offset += n_read;
	}

ret:
	_dos_close( hndla );
	_dos_close( hndlb );

	return err;
}

boeckmann avatar Oct 17 '24 20:10 boeckmann

Technically I see no good reason why reading not a multiple of 512 bytes would make more disk swaps neccessary. Even if the deblocking buffer is involved, that should not result in more disk swaps. Likely explanation would be that in order to invalidate the deblocking buffer, the disk must be there. I will first investigate this before looking at the original issue...

boeckmann avatar Oct 17 '24 20:10 boeckmann

as an optimization the buffer size SvarCOM uses should be rounded down to the next multiple of 512 bytes.

Thanks for the suggestion. SvarCOM's COPY uses a buffer that is always a multiple of 512 now: https://github.com/SvarDOS/core/commit/4525c40c2660c9528538a7bbb500e01abd316c73

With this new buffer size the int24h bug is no longer triggered. Maybe that's why EDR's COMMAND.COM works fine. And maybe it's a hint that the issue is somewhere in the "deblocking buffer" part.

Quickly created a copy test for checking this. Was not yet able to re-produce the error with it.

My copying code is this: https://github.com/SvarDOS/core/blob/master/svarcom/trunk/cmd/copy.c#L106

But as already stated, it's the size of the buffer that seems to matter.

mateuszviste avatar Oct 18 '24 08:10 mateuszviste

With this new buffer size the int24h bug is no longer triggered. Maybe that's why EDR's COMMAND.COM works fine. And maybe it's a hint that the issue is somewhere in the "deblocking buffer" part.

Great to hear. Makes this a little less urgent. Nevertheless we should try to find the root cause of this. It is probable that it has to do with the deblocking buffer.

Regarding the increased disk change count: If I understand the code correctly then before flushing the buffers the system makes sure that the inserted disk has not changed, and therefore has to access the BPB in case of the floppy is not accessed for longer than ~3 seconds, making additional drive changes neccessary. But will have a second look...

I think this is somewhat inefficient. Because if there are no dirty buffers I do not see a reason why the test for disk change is neccessary. But as I am not sure, for now I will go the "if it isn't broken, leave it as is" route.

boeckmann avatar Oct 18 '24 12:10 boeckmann

Call stack of the disk flush routine:

flush_buffer (buffers.asm)
  select_adrive (bdevio.asm)
   check_media (bdevio.asm)
     dd_medchk (disk.asm)
       ask_for_disk (disk.asm)

boeckmann avatar Oct 18 '24 12:10 boeckmann

In the below screenshot there is some diagnostic info shown which is output in the routine blockif (bdevio.asm). This routine is responsible for calling the actual read / write routine of the device driver.

The diagnostic info is shown in quadruples: first word is disk (high) and access mode (low), second and third words are 32-bit start sector, fourth value is sector count.

The access mode (first word low) is: bit 0 (0=read, 1=write), bit 2,1 (01=FAT, 10=DIR, 11=DATA).

So, interpreting the values we can deduce that accessing sector FFFFFFFF of drive A: failed. The operation is a write to what the kernel thinks to be a directory sector.

Bildschirmfoto 2024-10-19 um 21 32 56

The variable containing the illegal sector value is pblock, which by itself is set from fdrwsec. Not yet sure how FFFFFFFF lands in pblock. Neither pblock or fdrwsec are explicitly set to FFFFFFFF anywhere in the source. I have a guess that this may be the result of a calculation error.

@mateuszviste could you please test SvarCOM again with the rounding to 512 bytes turned off but with less bytes per write, something smaller than 32k? Or did you test this already? If the error does not occur with smaller write sizes this is likely to be some number overflow issue.

boeckmann avatar Oct 19 '24 19:10 boeckmann

@mateuszviste could you please test SvarCOM again with the rounding to 512 bytes turned off but with less bytes per write, something smaller than 32k?

That's what the version attached to this issue does. It uses a copy buffer of about 3K (and not a multiple of 512).

mateuszviste avatar Oct 19 '24 20:10 mateuszviste

That's what the version attached to this issue does. It uses a copy buffer of about 3K (and not a multiple of 512).

Not true, oops. It seems that I created this issue after implementing the big buffer logic in SvarCOM. I will compile the small buffer r2162 version tomorrow and test it.

mateuszviste avatar Oct 19 '24 21:10 mateuszviste

interpreting the values we can deduce that accessing sector FFFFFFFF of drive A: failed. The operation is a write to what the kernel thinks to be a directory sector.

So it is truly a WRITE to A: ? Then this yields another question: why is A: written to in the first place? It is supposed to be the source.

mateuszviste avatar Oct 19 '24 21:10 mateuszviste

Seems that flush_buffer is called recursively via its select_adrive calls (indirectly via media_check). But flush_buffer does not seem to be reentrant, because the buffer to be flushed is not re-tested after a potantial recursive flush. This results in the BCB fields containing unusable values for the flush.

boeckmann avatar Oct 19 '24 21:10 boeckmann

I think 5b1835d8b66363e15813f194e86e8e29b7ee7bee + a027cb89e00acc30cddfd35509c4453f4d9f6a58 fixed the issue. Problem was that flush_buffer was called recursively twice, with the second recursion operating on the buffer of the initial flush_buffer call (same drive). Initial flush_buffer call then continued to operate on an already flushed buffer.

boeckmann avatar Oct 19 '24 22:10 boeckmann

Picked into https://hg.pushbx.org/ecm/edrdos/rev/820c1299fa5c

ecm-pushbx avatar Oct 20 '24 08:10 ecm-pushbx

@mateuszviste could you please test SvarCOM again with the rounding to 512 bytes turned off but with less bytes per write, something smaller than 32k? Or did you test this already? If the error does not occur with smaller write sizes this is likely to be some number overflow issue.

Tested now with a buffer 3808 bytes (and old kernel) - problem confirmed to occur.

Upgraded EDR kernel to a027cb8: issue solved. Awesome, thank you! :-)

mateuszviste avatar Oct 20 '24 12:10 mateuszviste