btrfs-progs icon indicating copy to clipboard operation
btrfs-progs copied to clipboard

Cannot cancel device removal

Open smurfix opened this issue 4 months ago • 9 comments

# btrfs dev rem  /dev/sdg1 /mnt/x &
…
# btrfs dev rem cancel /mnt/x
Request to cancel running device deletion
ERROR: error removing device 'cancel': Operation canceled
#

The removal continues unabated.

smurfix avatar Aug 04 '25 12:08 smurfix

You can just kill the process. I don't know why btrfs device remove cancel is there at all.

frukto avatar Sep 24 '25 11:09 frukto

Maybe it's easier than grepping through "ps" output? esp. when there's more than one btrfs in the system and the job uses a relative path …

NB there are some file system- and device-level operations, where killing things does not affect the operation. e.g. when you move a volume with lvm. Thus I would not expect killing the job to actually do anything.

smurfix avatar Sep 24 '25 12:09 smurfix

Well, if the job does not background itself (as with lvm or scrubbing) my expectation is, that it actually (gracefully) stops its operation when it gets killed (gracefully).

frukto avatar Sep 24 '25 13:09 frukto

You can just kill the process. I don't know why btrfs device remove cancel is there at all.

You know a process can not be killed by signal if it's trapped in kernel space? As the signal handling is happening in user space.

It's only working because inside those ioctls we explicitly check the pending signals, and even with that checks, it only works for fatal ones.

The removal continues unabated.

Any dmesg? And kernel version?

adam900710 avatar Sep 24 '25 22:09 adam900710

@adam900710 Fair point. Are you implying that sending SIGTERM/SIGINT to a device remove is generally unsafe?

frukto avatar Sep 25 '25 06:09 frukto

SIGTERM/SIGINT just won't do anything. Scrub (dev-replace is reusing scrub path) and balance only checks fatal signal, only SIGKILL counts.

So that's why we have ioctls to cancel/pause dev-replace/scrub/relocation.

adam900710 avatar Sep 25 '25 06:09 adam900710

SIGTERM/SIGINT just won't do anything. Scrub (dev-replace is reusing scrub path) and balance only checks fatal signal, only SIGKILL counts.

A scrub command starts the scrubbing and returns immediately. So there is basically nothing where to send a signal to. But btrfs device remove stays active:

btrfs dev rem  /dev/sdg1 /mnt/x
^C    (SIGINT)

So (unless stuck in kernel space), this should gracefully cancel the device removal? Does it? At least in my cases it seemed to work this way.

(Sorry for capturing/diverting from the issue)

frukto avatar Sep 25 '25 07:09 frukto

Scrub (dev-replace is reusing scrub path) and balance only checks fatal signal, only SIGKILL counts.

Balance can be cancelled by Ctrl-C, ie SIGINT, but it's indeed only checking SIGKILL (fatal_signal_pending), I'm puzzled.

kdave avatar Sep 25 '25 07:09 kdave

Scrub (dev-replace is reusing scrub path) and balance only checks fatal signal, only SIGKILL counts.

Balance can be cancelled by Ctrl-C, ie SIGINT, but it's indeed only checking SIGKILL (fatal_signal_pending), I'm puzzled.

It looks like it's wait_one_bit() inside btrfs_relocate_block_group(), which has TASK_INTERRUPITABLE.

And the real wait function is bit_wait(), which checks any pending signal, not only the fatal ones.

adam900710 avatar Sep 25 '25 08:09 adam900710