btrfs icon indicating copy to clipboard operation
btrfs copied to clipboard

Windows IO subsystem drops out, leaving system running but deadlocked

Open smopucilowski opened this issue 2 years ago • 52 comments

Over the last year, using winbtrfs versions 1.7.3 through to current 1.7.9 on laptops with operating system on ntfs and data on btrfs, we have been experiencing soft hangs. We have observed this across multiple machines with varying hardware.

Specifically, Windows 10/11 operates normally with light btrfs use with subvolumes, until some point suddenly the entire IO storage stack stops processing requests. Programs running in memory, e.g. internet chat or music streaming, continue to function until they make their own storage IO request. They then soft hang waiting for their own IO requests to complete. Since the IO subsystem is not responding, it's not possible to interact with anything requiring further storage IO, i.e. bringing up a task manager or explorer. Windows system logs don't report anything when the system hangs. It's as if the entire IO subsystem is stuck in some deadlock. I would expect after 60 seconds for everything to crash out saying disk timeouts, however, IO stalls don't time out even after leaving the machines for hours.

The only solution is to hard reset to laptops.

Possibly linked to laptop use with varying sleep states?

How may I go about assisting debugging this? Is there a remote debug facility in windows that can dump stacks/crashes over the network?

smopucilowski avatar Nov 12 '21 08:11 smopucilowski

I can confirm this is happening exactly as you described. On a Windows 10 PC, so I don't think it is linked to laptops or sleep states.

Unfortunately, I don't have any logs or steps to reproduce. It just happens sometimes.

kwencel avatar Nov 12 '21 13:11 kwencel

This also happens quite often to me. Usually minutes after doing subvolume operations or writing to multiple subvolumes in quick successions. I just can't tell the exact trigger and have to force reset the machine. Windows 10 and is happening since I started using WinBtrfs. I just never created an issue since I don't know where to start debugging this.

ticpu avatar Nov 12 '21 18:11 ticpu

Started happening here recently on 2 different computers as well. One is a regular partition, the other is a striped partition across 2 SSDs. Sometimes it comes back on its own after a while, sometimes I have to hard reset. Seems to be triggered by heavy writes (usually happens while downloading games on steam for instance). It did not happen a couple versions ago so it looks like a new regression.

alucryd avatar Nov 14 '21 17:11 alucryd

Can also confirm having similar problems all the time, no idea how to even start to debug this.

PJB3005 avatar Nov 21 '21 18:11 PJB3005

Whatever it is, it's not something obvious, and quite possibly something introduced with later versions of Windows. I'm currently working on a comprehensive test suite which will hopefully help in diagnosing this.

maharmstone avatar Nov 21 '21 19:11 maharmstone

This problem can be reliably reproduced, by playing forza horizon 5. Can i do something to help debug this?

mhetzi avatar Dec 04 '21 09:12 mhetzi

In my experience running steam games off my BTRFS partition, it seems really consistent once it starts happening for a specific game. Like, Steam consistently locking up while updating, or the game consistently locking up while starting. So I can also reliably reproduce this if I have some debugging instructions.

I tried using the kernel event logging thing mentioned in the readme but couldn't get it to show anything.

PJB3005 avatar Dec 08 '21 20:12 PJB3005

FWIW, it seems like I got it also deadlocked with latest WinBtrfs (v1.7.9, with enabled zstd compression) just using robocopy copying a large folder with /MIR option from NTFS drive to BTRFS drive. Almost immediate a BTRFS drive was not accessible elsewhere, e. g. calling properties in file explorer for the drive showing the error "The process cannot access the file because it is being used by another process". The robocoby is still running further a bit, but after several minutes it starts to stutter with "ERROR 32 (0x00000020) Accessing Destination Directory". At end it was completely locked, so not accessible in any process, until restarted. I was able to reproduce the same behavior few times.

sebres avatar Dec 10 '21 23:12 sebres

I don't know if its related or not or if I need to open a new ticket but I've been noticing when I updated games (large games) on steam my btrfs partition will cease to mount or exist unless I select on it and refresh to bring it back to life.

Could this be a similar IO failure? I AM using disk compression with zstd:3 so maybe that is related.

So far just using the SSD/NVMe drives normally is fine, its just when steam does its ultra intensive IO operations to update/scan something; the partition shows no size value and appears to stop working as steam stalls (until I select drive in explorer and refresh it a few times).

jarrard avatar Dec 19 '21 03:12 jarrard

Same thing happened to me. BTRFS on a 2TB disk, single partition Steam library on it. After downloading Killing Floor 2 and trying to install -> it froze Game is about 50Gb

Since it froze, I couldn´t do anything. Steam crashed. Explorer was stuck, and restarting was stuck... I had to power off while "Restarting ..." was printed.

Later did a "scrub" from windows and it found no error... but right after that got read-only, with cyclic redundancy error. Then went to Linux, fstab crashed because it couldn't mount the drive no more. Had to comment out my fstab in rescue mode.

The error:

Opening filesystem to check...
checksum verify failed on 1223022346240 wanted 0x00000000 found 0xb6bde3e4
checksum verify failed on 1223022346240 wanted 0x00000000 found 0xb6bde3e4
checksum verify failed on 1223022346240 wanted 0x00000000 found 0xb6bde3e4
bad tree block 1223022346240, bytenr mismatch, want=1223022346240, have=0
ERROR: cannot read chunk root
ERROR: cannot open file system

These errors are probably due to me power restarting. The actual issue is the fact that it deadlocked. Not sure if I can help with providing more info since nothing made any obvious "crash" . But I'm a dev. Not a C dev though. I have VS installed maybe I can debug some stuff, but I need to be guided.

I'm in the process of doing a chunk-recover. I don´t care if my data is lost, I'm trying out WinBTRFS... But wow, that's very unstable.

RedKage avatar Dec 31 '21 15:12 RedKage

I've been testing my BTRFS partitions under Linux for a while now without too many issues; but loosing power is a pretty big nasty thing to happen for solid state drives, most times results in some files being messed. It does seem like the drives are a little more sensitive under windows then Linux.

Fortunately I don't get too many of these IO errors and if I do typically clicking on the stalled btrfs drive in windows explorer eventually brings it back just fine and steam resumes its operations.

There are more detailed logs in the windows event viewer which might be more useful for figuring out whats happening. I forget exactly where however. They produce .evtx files.

jarrard avatar Jan 01 '22 01:01 jarrard

FWIW, it seems like I got it also deadlocked with latest WinBtrfs (v1.7.9, with enabled zstd compression) just using robocopy copying a large folder with /MIR option from NTFS drive to BTRFS drive. Almost immediate a BTRFS drive was not accessible elsewhere, e. g. calling properties in file explorer for the drive showing the error "The process cannot access the file because it is being used by another process". The robocoby is still running further a bit, but after several minutes it starts to stutter with "ERROR 32 (0x00000020) Accessing Destination Directory". At end it was completely locked, so not accessible in any process, until restarted. I was able to reproduce the same behavior few times.

The same thing happens here when using robocopy to move 70gb of files from a ntfs disk to a mounted btrfs VHD, except robocopy runs infinitely without showing any error and doing no changes to any files, any process that tries to access the mounted partition freezes with no way to close them. I'm using v1.7.9 with zstd:6 and /csum:xxhash. I'm running Windows 7 x64.

EDIT: This also happens when Steam downloads a game or when I copy a file through explorer.

Gummar avatar Jan 06 '22 05:01 Gummar

It happened on my computer again, when Forza Horizon 5 was running. Windows 10 was unable to close the game or steam and could not shut down either (it would just hang with a dark screen). Afterwards on Linux I ran btrfsck:

❯ sudo btrfsck [REDACTED]
Opening filesystem to check...
Checking filesystem on [REDACTED]
UUID: [REDACTED]
[1/7] checking root items
[2/7] checking extents
super bytes used 1686982955008 mismatches actual used 1683220516864
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space tree
free space info recorded 35 extents, counted 34
there is no free space entry for 609915752448-609915764736
cache appears valid but isn't 608842022912
there is no free space entry for 622789128192-622789132288
there is no free space entry for 622789128192-622800666624
cache appears valid but isn't 621726924800
free space info recorded 1 extents, counted 0
there is no free space entry for 698413178880-698413187072
there is no free space entry for 698413178880-699036336128
cache appears valid but isn't 697962594304
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 1683220516864 bytes used, error(s) found
total csum bytes: 1640095780
total tree bytes: 3762438144
total fs tree bytes: 1400471552
total extent tree bytes: 473677824
btree space waste bytes: 581650624
file data blocks allocated: 1710374117376
 referenced 1882496815104

devurandom avatar Jan 29 '22 13:01 devurandom

Possibly related:

  • https://github.com/maharmstone/btrfs/issues/418
  • https://github.com/maharmstone/btrfs/issues/345
  • https://github.com/maharmstone/btrfs/issues/332
  • https://github.com/maharmstone/btrfs/issues/314
  • https://github.com/maharmstone/btrfs/issues/298
  • https://github.com/maharmstone/btrfs/issues/259

devurandom avatar Jan 30 '22 11:01 devurandom

I'm having the same issue on a 2TB striped Btrfs volume. Though the symptoms are the same, I don't think the entire IO subsystem is dropping out.

The issue occurs while actively using the drive for me, so I let open a command prompt while the drive was in use. When my system started to freeze, I checked what was accessible within that command prompt. My system drive (NTFS) and a Btrfs drive I was not actively using were both accessible, and I could read and write to both drives.

It seems like there's just a file (or maybe disk) operation hanging on the Btrfs drive, taking most of the system down with it, due to stuff like explorer depending on that drive. This suspicion is further confirmed by Linux: every time this happens, another unrecoverable error gets added to the volume (btrfs scrub), probably indicating an unsuccesful write.

More information about the volume itself:

Device size:		1.8 TB
Device allocated:	614.0 GB
Device unallocated:	1.2 TB
Data ratio:		1.00
Metadata ratio:		2.00
Data, RAID0: size: 608.0 GB, used: 607.4 GB
Disk 1, partition 1	304.0 GB
Disk 2, partition 1	304.0 GB
Metadata, RAID1: size: 3.0 GB, used: 1.4 GB
Disk 1, partition 1	3.0 GB
Disk 2, partition 1	3.0 GB
System, RAID1: size: 8.0 MB, used: 64.0 KB
Disk 1, partition 1	8.0 MB
Disk 2, partition 1	8.0 MB
Unallocated:
Disk 1, partition 1	624.5 GB
Disk 2, partition 1	624.5 GB

AEAEAEAE4343 avatar Feb 13 '22 19:02 AEAEAEAE4343

I'm having the same issue... and I could read and write.

Then you had probably another issue. At least in my case the (robo)copy process with following "freeze" always causes total deadlock and btrfs drive is inaccessible completely, neither for write nor for read operations unless computer gets restarted.

sebres avatar Feb 14 '22 04:02 sebres

Read what I wrote again. I have three drives, two of which are Btrfs. When I write to my big Btrfs storage, after a lot of writing it freezes. The big Btrfs drive freezes, but my other partitions work fine. (At least in CMD, explorer freezes regardless of what directory is selected)

AEAEAEAE4343 avatar Feb 14 '22 09:02 AEAEAEAE4343

but my other partitions work fine.

Why they shouldn't? The work with kernel driver takes place asynchronously via IoQueue and WorkItems and tasks are isolated from user space per single drive/volume. If such queue or worker gets deadlocked it'd only affect single volume solely.

explorer freezes regardless of what directory is selected

This is happening because of the matter how explorer renders (and reads) the items in tree in left part of window, for instance if it has some reference to directory from affected drive in "Quick access" or because it is still accessing the drive info or something similar. Anyway note that the explorer is normally a single process (no matter how many folder windows you'd open there), at least unless the checkbox "Launch folder windows in a separate process" gets checked in "Folder options / View".

sebres avatar Feb 14 '22 11:02 sebres

Why they shouldn't?

I never said it should. It’s just that the original question suggests that:

suddenly the entire IO storage stack stops processing requests. Programs running in memory, e.g. internet chat or music streaming, continue to function until they make their own storage IO request. They then soft hang waiting for their own IO requests to complete. Since the IO subsystem is not responding, it's not possible to interact with anything requiring further storage IO, i.e. bringing up a task manager or explorer.

All I’m saying is that this isn’t the case, only the drive that gets stuck is inaccessible, which makes sense. If the entire IO subsystem were to drop out, I think the system would bluescreen immediately, not partially freeze up.

for instance if it has some reference to directory from affected drive in "Quick access" or because it is still accessing the drive info or something similar

In my case, I have shortcuts to my Btrfs drive on my desktop which freeze explorer. For file explorer itself, it probably tries reading metadata.

AEAEAEAE4343 avatar Feb 14 '22 11:02 AEAEAEAE4343

Is anyone still experiencing these issues on v1.8?

ArinL avatar Mar 17 '22 19:03 ArinL

I've been mostly using Linux so haven't had much time to test. But I also haven't seen any winbtrfs updates for a while.

jarrard avatar Mar 17 '22 21:03 jarrard

I had a btrfs partition corrupt itself after a recovery power cycle (no valid root node), so I'm a little shy to experiment further and am running with read-only mounting for now.

smopucilowski avatar Mar 17 '22 22:03 smopucilowski

After updating to 1.8, I can no longer boot my system at all. EDIT: I have to correct myself. Fast startup was probably interfering with it, after booting into safe mode once the system has recovered. As for if the problem is resolved, I'm not entirely sure as I haven't done any intensive usage yet. One thing I can comment about is access times. They have improved by a lot compared to the older version.

AEAEAEAE4343 avatar Mar 18 '22 20:03 AEAEAEAE4343

I think the problem is resolved. Speeds have improved and after testing by copying a ton of data (~100GB) I haven't had any issues.

AEAEAEAE4343 avatar Mar 20 '22 20:03 AEAEAEAE4343

Forza Horizon 4. Forza Horizon 5 and rFactor 2 still cause the driver to crash.

Giovani1906 avatar Mar 22 '22 17:03 Giovani1906

Hit the issue again on winbtrfs v1.8 on a freshly formatted btrfs partition in Linux (2tb, space_cache=v2,compress=zstd), copying a few gigabytes into a subvolume. Scrub reports no issues, no corrupted partition encountered yet.

smopucilowski avatar Mar 24 '22 07:03 smopucilowski

Hitting this bug infrequently when playing some long (~1 hour) files with foobar2000. It can't "close" the audio file, ie. stop/go to another track and just hangs. They're Opus files, don't know if it happens with other codecs.

ghost avatar Mar 31 '22 00:03 ghost

I've had relative success with infrequent large file copies (many gigabytes, queue depth of 1).

It feels like this bug triggers on multiple small read/writes, or IO with deep queue depth.

smopucilowski avatar Apr 06 '22 04:04 smopucilowski

I still have the problem on v1.8. In my case it usually happens when steam is downloading or updating a game. I have to hard reset my computer because it becomes unresponsive.

gmmoreira avatar Apr 29 '22 12:04 gmmoreira

I have the same issue. Happens when unpacking a 11GB .zip file from and also onto a btrfs partition.

Kamilcuk avatar Jul 04 '22 18:07 Kamilcuk