zfs icon indicating copy to clipboard operation
zfs copied to clipboard

ZSTD-related crashes after many iterations of a "bad task"

Open sempervictus opened this issue 5 years ago • 94 comments

System information

Type Version/Name
Distribution Name Arch
Distribution Version Current
Linux Kernel 5.4.84
Architecture x64
ZFS Version 2.0

Describe the problem you're observing

Describe how to reproduce the problem

  • Create 1G ZFS single-dataset-pool with dedup=on checksum=sha512 compression=zstd-19
  • Run impossible loop against it like while true do; (rsync -a /usr/* <path to dataset mount>) || (rm -rf <path to mount>/); done
    • Wait for many iterations
    • Watch the bloodbath in dmesg after a while

Include any warning/errors/backtraces from the system logs

Dec 22 03:59:56 svl-arch00 kernel: BUG: unable to handle page fault for address: 00000007fffffffc
Dec 22 03:59:56 svl-arch00 kernel: #PF: supervisor read access in kernel mode
Dec 22 03:59:56 svl-arch00 kernel: #PF: error_code(0x0000) - not-present page
Dec 22 03:59:56 svl-arch00 kernel: PGD 0 P4D 0 
Dec 22 03:59:56 svl-arch00 kernel: Oops: 0000 [#1] SMP
Dec 22 03:59:56 svl-arch00 kernel: CPU: 1 PID: 12186 Comm: z_wr_iss Tainted: G                T 5.4.84 #1
Dec 22 03:59:56 svl-arch00 kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
Dec 22 03:59:56 svl-arch00 kernel: RIP: 0010:zfs_ZSTD_compressBlock_btultra2+0x2cdd/0x49d0
Dec 22 03:59:56 svl-arch00 kernel: Code: c7 00 00 00 00 00 8b 44 24 50 83 e8 08 41 89 46 24 45 85 db 0f 84 d7 f9 ff ff 48 8b 9c 24 f0 00 00 00 41 8d 43 ff 48 8d 04 c3 <44> 8b 60 04 44 39 64 24 28 0f 82 89 15 00 00 8b 9c 24 d8 00 00 00
Dec 22 03:59:56 svl-arch00 kernel: RSP: 0018:ffff9ea7c23ef7c0 EFLAGS: 00010202
Dec 22 03:59:56 svl-arch00 kernel: RAX: 00000007fffffff8 RBX: 0000000000000000 RCX: ffff9ea7d6634a40
Dec 22 03:59:56 svl-arch00 kernel: RDX: ffff9ea7d662d400 RSI: ffff9ea7d6634a41 RDI: ffff9ea7d662d4ff
Dec 22 03:59:56 svl-arch00 kernel: RBP: ffff9ea7c23ef940 R08: ffff9ea7d6484280 R09: 000000000000001e
Dec 22 03:59:56 svl-arch00 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
Dec 22 03:59:56 svl-arch00 kernel: R13: ffff9ea7d6634a41 R14: ffff91e6ad380320 R15: 0000000000000003
Dec 22 03:59:56 svl-arch00 kernel: FS:  0000000000000000(0000) GS:ffff91e77ba80000(0000) knlGS:0000000000000000
Dec 22 03:59:56 svl-arch00 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 22 03:59:56 svl-arch00 kernel: CR2: 00000007fffffffc CR3: 000000013a0a1006 CR4: 0000000000360ee0
Dec 22 03:59:56 svl-arch00 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 22 03:59:56 svl-arch00 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Dec 22 03:59:56 svl-arch00 kernel: Call Trace:
Dec 22 03:59:56 svl-arch00 kernel:  zfs_ZSTD_buildSeqStore+0x192/0x2a0
Dec 22 03:59:56 svl-arch00 kernel:  zfs_ZSTD_compressBlock_internal+0x2a/0x2f0
Dec 22 03:59:56 svl-arch00 kernel:  zfs_ZSTD_compressContinue_internal+0x459/0x6d0
Dec 22 03:59:56 svl-arch00 kernel:  zfs_ZSTD_compressEnd+0x35/0x370
Dec 22 03:59:56 svl-arch00 kernel:  zfs_ZSTD_compressStream2.part.0+0x60e/0x8e0
Dec 22 03:59:56 svl-arch00 kernel:  ? zfs_ZSTD_compress2+0x7a/0xc0
Dec 22 03:59:56 svl-arch00 kernel:  zfs_ZSTD_compress2+0x7a/0xc0
Dec 22 03:59:56 svl-arch00 kernel:  zfs_zstd_compress+0xdd/0x170
Dec 22 03:59:56 svl-arch00 kernel:  zio_compress_data+0xc1/0xf0
Dec 22 03:59:56 svl-arch00 kernel:  zio_write_compress+0x44b/0x8f0
Dec 22 03:59:56 svl-arch00 kernel:  ? __wake_up_common_lock+0x86/0xb0
Dec 22 03:59:56 svl-arch00 kernel:  zio_execute+0x8a/0x130
Dec 22 03:59:56 svl-arch00 kernel:  taskq_thread+0x2ce/0x510
Dec 22 03:59:56 svl-arch00 kernel:  ? try_to_wake_up.isra.0+0x3a0/0x3a0
Dec 22 03:59:56 svl-arch00 kernel:  ? zio_destroy+0xe0/0xe0
Dec 22 03:59:56 svl-arch00 kernel:  ? taskq_thread_spawn+0x60/0x60
Dec 22 03:59:56 svl-arch00 kernel:  kthread+0x10f/0x130
Dec 22 03:59:56 svl-arch00 kernel:  ? kthread_associate_blkcg+0x90/0x90
Dec 22 03:59:56 svl-arch00 kernel:  ret_from_fork+0x1f/0x40
Dec 22 03:59:56 svl-arch00 kernel: Modules linked in: intel_rapl_msr intel_rapl_common intel_pmc_core_pltdrv intel_pmc_core kvm_intel kvm irqbypass crct10dif_pclmul ghash_clmulni_intel joydev rapl mousedev uio_pdrv_genirq qxl evdev uio input_leds led_class mac_hid ttm drm_kms_helper fb_sys_fops syscopyarea sysfillrect psmouse i2c_piix4 sysimgblt button pcspkr qemu_fw_cfg sch_fq_codel drm fuse ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 ata_generic pata_acpi virtio_net net_failover virtio_blk virtio_balloon failover serio_raw atkbd libps2 crc32_pclmul crc32c_intel uhci_hcd aesni_intel glue_helper crypto_simd cryptd ehci_pci ehci_hcd usbcore virtio_pci usb_common ata_piix i8042 floppy serio
Dec 22 03:59:56 svl-arch00 kernel: CR2: 00000007fffffffc
Dec 22 03:59:56 svl-arch00 kernel: ---[ end trace ba9620a04c37d7fc ]---
Dec 22 03:59:56 svl-arch00 kernel: RIP: 0010:zfs_ZSTD_compressBlock_btultra2+0x2cdd/0x49d0
Dec 22 03:59:56 svl-arch00 kernel: Code: c7 00 00 00 00 00 8b 44 24 50 83 e8 08 41 89 46 24 45 85 db 0f 84 d7 f9 ff ff 48 8b 9c 24 f0 00 00 00 41 8d 43 ff 48 8d 04 c3 <44> 8b 60 04 44 39 64 24 28 0f 82 89 15 00 00 8b 9c 24 d8 00 00 00
Dec 22 03:59:56 svl-arch00 kernel: RSP: 0018:ffff9ea7c23ef7c0 EFLAGS: 00010202
Dec 22 03:59:56 svl-arch00 kernel: RAX: 00000007fffffff8 RBX: 0000000000000000 RCX: ffff9ea7d6634a40
Dec 22 03:59:56 svl-arch00 kernel: RDX: ffff9ea7d662d400 RSI: ffff9ea7d6634a41 RDI: ffff9ea7d662d4ff
Dec 22 03:59:56 svl-arch00 kernel: RBP: ffff9ea7c23ef940 R08: ffff9ea7d6484280 R09: 000000000000001e
Dec 22 03:59:56 svl-arch00 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
Dec 22 03:59:56 svl-arch00 kernel: R13: ffff9ea7d6634a41 R14: ffff91e6ad380320 R15: 0000000000000003
Dec 22 03:59:56 svl-arch00 kernel: FS:  0000000000000000(0000) GS:ffff91e77ba80000(0000) knlGS:0000000000000000
Dec 22 03:59:56 svl-arch00 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 22 03:59:56 svl-arch00 kernel: CR2: 00000007fffffffc CR3: 000000013a0a1006 CR4: 0000000000360ee0
Dec 22 03:59:56 svl-arch00 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 22 03:59:56 svl-arch00 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

sempervictus avatar Dec 22 '20 09:12 sempervictus

As ZSTD-19 is NOT very compatible with low to medium-end systems, please share your systems specs.

Also: Be aware that ZSTD update might make your dedupe less efficient and ZSTD will be kept up-to-date, hence ZSTD is not adviced to be used with deduplication. But that should not lead to crashes like this.

PrivatePuffin avatar Dec 22 '20 10:12 PrivatePuffin

System's a test VM, 4GB of RAM, 4 KVM cores. Regardless how memory-constrained a system is, this shouldn't happen. Kinda dampens the whole "portability" premise :). If ZFS permits this sort of resource starvation, there are any number of potential DoS-style attacks which could be carried out against the compression implementation with potentially adverse outcomes. The fact that it happens after many interations though does lead me to wonder - is there a memory leak somewhere?

sempervictus avatar Dec 22 '20 14:12 sempervictus

ZSTD-19 is NOT supposed to be used for general use and only reasonable for VERY specific usecases.

Your system is not capable of running ZSTD-19 on any reasonably heavy workload. Just like it isn't capable of running dedupe on a reasonably big workload. You run a compression level that REQUIRES an extreme amount of ram. Running into issues is to be expected...

TLDR: ZSTD-19 is a very-VERY hungry beast, not to be used unless you know what you are doing.

However: There might be a bug besides it being an extremely edgecase setup: What I think is happening is it's spawning new memory pools before the old ones are removed, which enhances the inherent memory consumption issues native to ZSTD-19. Any thoughts @BrainSlayer ?

PrivatePuffin avatar Dec 22 '20 14:12 PrivatePuffin

That is dangerous - an attacker can create or modify a zpool in a different context of privilege (VM/container/separate host) which you the user have no choice but to import with the setting they applied. So an attacker from a 16G domU can make a storage pool which will merc the 4G dom0. This is no different from a file-format attack, just an fs format one.

sempervictus avatar Dec 22 '20 14:12 sempervictus

As I said: It's not "just" your system being incapable of running ZSTD-19, there might also be a bug making the issue worse.

If you, as a system administrator, have no other choice than load an unknown volume on a production machine without precautions., you need to call the Police not OpenZFS.

But like I said: It's not just ZSTD-19, there seems to be an additional factor at play.

PrivatePuffin avatar Dec 22 '20 14:12 PrivatePuffin

zstd-19 is very slow and takes about 20 - 30 megs ram per thread. a typical 4 core system (with hyperthreading enabled) uses 32 threads. so typical task hanging warnings by the kernel are normal. if memory allocation fails zstd compression will be disabled for the requested block. so i'm uncertain why it ends in a page fault in your case. but i'm sure its a problem with your memory contrained test scenario and it can be also a problem by memory virtualisation in QEMU. so qemu returns success for the requested allocation but its not present in reality. qemu does virtualize the memory allocations. so the test case is not valid for my oppinion. its a emulator and no real system. but its not dangerous if a attacker send you a bad export. you cannot import or change properties without root privileges. and the trustlevel for such a file is on user side. its like you receive a trojan horse as email attachement. its up to you if you run it or not

BrainSlayer avatar Dec 22 '20 14:12 BrainSlayer

@BrainSlayer I think using this would retry the memory allocation (which gets disabled) so often, it doesn't have the chance to get rid of space used by any previous memory pool threads: "while true do; (rsync -a /usr/* ) || (rm -rf /); done"

PrivatePuffin avatar Dec 22 '20 15:12 PrivatePuffin

So as far as you the system administrator preventing import of pools, you can't delegate ability to import a pool based on the metadata of its datasets, only the IDs of the subject. Secondly, you can't stop systemd from trying to import a zpool on-boot in AWS for instance - no bootloader access to preempt it and rescue target the mess. Lastly, you cannot know what was done to the pool from another context until you import it. Basically this is a time bomb, and DoS of a storage plane is a really bad kind of DoS, doubly so if it results in the data being inaccessible. If the function is known to be dangerous, experimental, and not for use today, why in the world is it an option to create/configure a pool that could be incompatible in certain contexts when the whole idea of OpenZFS is portability? Saying virtual machines arent real systems is also... inaccurate? If it has a kernel, its an OS. Cisco ASA, Juniper SRX, etc are all VMs by the way - purchasing departments seem to qualify them as real systems globally. Given the potential instability of this and that with every day more and more people might be trying to create archival datasets with the maximum possible compression, it might be prudent to pull the breaking levels out of the tagged release so consumers dont learn this the hard way after putting their real data in there.

sempervictus avatar Dec 22 '20 17:12 sempervictus

Maybe stop ranting and start listening?

I told you that ZSTD-19 is too heavy for your hardware, but that your problems are NOT only caused by it.

I verified one of my working theories with @BrainSlayer what could be the cause and he confirmed the code should pagefault and OOM issues and he also is not sure what is causing the issue.

if memory allocation fails zstd compression will be disabled for the requested block. so i'm uncertain why it ends in a page fault in your case

Both me and @BrainSlayer where trying to find the problem you are having and by getting all worked up about our theorycrafting, you are making the job insanely hard for us.

the breaking levels

There is no "the breaking levels", your bug is not specific to a certain level. Nor do we (at the moment) even know what is exactly causing your issue. Can we stop throwing a fuss and work together on solving the issue instead, please?

Simply put your issues seems to be a combination of 4 (or more) factors:

  • Low RAM
  • Multiple Threads
  • High ZSTD levels
  • Currently Unknown Cause

I was consulting with Brainslayer wheither the first 3 alone would be enough to cause this and he doesn't think it does, because it should just not-compress the data (which is actually tested seperately at the end of developement, by manually triggering all allocation failure failover options). Which brings us back to the "Currently Unknown Cause", which might even be triggerable on lower zstd-levels.

PrivatePuffin avatar Dec 22 '20 18:12 PrivatePuffin

@BrainSlayer Can we put a check into ZSTD that one need to have atleast 2x the amount of predicted memory consumtion of a ZSTD-level before it can be enabled? (in which case it would fall back to zstd-3)

I don't think thats the root cause of this issue, but it does prevent the endless flow of support calls about people trying ZSTD-19 on low-end hardware or complain about there being even an option to do so...

PrivatePuffin avatar Dec 22 '20 18:12 PrivatePuffin

@Ornias1993 thats hard to check. i mean free memory is not free memory. there is cached and buffers which are available too for allocation. so definition of free memory is, if allocation fails nothing is free. linux does reclaim memory on demand if required. if we use atomic allocations with kmalloc again. but this will fail in 80% at all allocation attempts for big memory blocks since atomic is defined to return memory in a fast time without reclaiming. from my point of view the qemu environment is abstracting memory allocations to local virtual memory. this may include swapping. this is not the case for real kernel environments which uses hard constrained memory allocations. so vmalloc will fail if there is nothing available. but qemu wont fail and just pretends that memory is available, even if not. normally the memory pool will only contain guaranteed available memory which is direct accessible

BrainSlayer avatar Dec 22 '20 18:12 BrainSlayer

@BrainSlayer I was actually refering to total-system memory, not just free... Because, indeed, free memory would be dynamic. The question is where in the compression chain would we be able to do so?

Lets assume for the moment the issue isn't qemu related (I would be able to get some hardware going, with specifications similair to this issue, if that ends up to be our only remaining option in a few days)...

PrivatePuffin avatar Dec 22 '20 18:12 PrivatePuffin

@Ornias1993: There's no fuss, there's a stated concern about having release code which can do this out to the public right now, compounded by the fact that level 19 is not supposed to be used while being presented to users. A DoS is a security problem, and security workflows cut off the potential access vectors to the bug first and then fix it while attackers cant get to it (in this case our most likely adversary is a user accidentally moving their entire life's backup into a dataset configured this way and having it go corrupt, but the reds are probably reading this - issue trackers are free targeting guides :) ) Agreed that it needs diagnostic and redress of the root cause, simply saying that while the bug is known to exist, a quick patch to prevent users setting any level above 14 or whatever's considered "safe for all consumers" should go into the release branch as a safety measure for users installing from DKMS RPMs or whatever their process is for this. Time == faulty installs which may == lost data/time/money. We work in enterprise space where something like this being left in the release code does not help build confidence among those people for FOSS. @BrainSlayer: there is no swap in play anywhere, host system is all ZFS with 64G on a gen8 i9HK, point was to run constrained. I had tested the ZSTD with the kvmalloc code you'd originally pushed to the 0.8.x series and dont recall what the maximum compression was in it, but i dont remember it hitting anything like this. We tend to run these "exhaust the space and try to break it" tests before we upgrade production pools to see if there are edge cases ZTS doesn't address (like leaky space maps). Since its kernel code, testing in containers is rough, and test metal needs a full offload of all other test workloads to run this on it. I can stand up a physical test host over the break if it gets that far, but for now at least the most easily available environment is KVM. I re-ran the test at zstd=9 for 30m with no crash, so thats a win. Running at 14 currently, will bump it to 16 next if this doesnt die miserably.

sempervictus avatar Dec 22 '20 18:12 sempervictus

There's no fuss, there's a stated concern about having release code which can do this out to the public right now.

There is a Bug and you already have 2 people working on the cause. So APPARANTLY your issue is taken serieusly. Stop the fearmongering, we don't even know for sure it is an issue spread beyond just your system.

compounded by the fact that level 19 is not supposed to be used while being presented to users.

Don't start selectively quoting me. Thats not what I said.

I Said:

ZSTD-19 is NOT supposed to be used for general use and only reasonable for VERY specific usecases.

What I mean was: The same conditions apply with ZSTD-19 (and other High zstd levels) that are implace for Deduplication. Nothing more, nothing less. Deduplication is also not supposed to be used for general use and only reasonable in VERY specific usecases.

A DoS is a security problem

Which you have not proven as of yet. It might or might not be anything serieus and it might or might not be possible to turn this into a DDoS for... A RaspberryPi. Chill out.

and security workflows cut off the potential access vectors to the bug first

Security workflows also don't do that when the first dude shouts (and nothing more) DDoS on a forum. We don;t even know what the vector IS. Potential access vector might not even BE ZSTD.

in this case our most likely adversary is a user accidentally moving their entire life's backup into a dataset configured this way and having it go corrupt, but the reds are probably reading this - issue trackers are free targeting guides

Reds, please be informed: if people have inherently idiotic compression levels that are not compatible with their system AND are running a full on the equivalent of a raspberrypi AND are executing your writes generating script... You might be in luck. ohh and it might not even be these requirements that caused this issue in the first place. Have a nice day hacking.

Agreed that it needs diagnostic and redress of the root cause, simply saying that while the bug is known to exist, a quick patch to prevent users setting any level above 14 or whatever's considered "safe for all consumers" should go into the release branch as a safety measure for users installing from DKMS RPMs or whatever their process is for this.

Okey, Please disable:

  • ZSTD
  • GZIP
  • Deduplication
  • Wait while you are at it, remove ZFS completely. Your phone might coke on it.

And cherry on top: WE DO NOT KNOW WHAT IS CAUSING THIS. (yet) It might not even be ZSTD related and we don't even know what levels are and are not affected.

PrivatePuffin avatar Dec 22 '20 18:12 PrivatePuffin

I'm not going to continue discussing your ongoing rampage about a possible DDOS. I rather bugtrace your issue to actually figure out what is the cause of this, which might (HINT) not even be ZSTD or even ZFS. We currently don't even know if the bug is an issue in code or related to your specific system.

So please supply the following information:

  1. What ZSTD levels are and are not causing these issues? (which you are already testing, which is nice)
  2. Are you having the same with GZIP (try highest first, if thats okey i'll believe it to be alright)?
  3. Are you running on the host? What is the host running?
  4. What is the type of storage you are using?
  5. Could you take some memory consumption readings from your hypervisor so we can time it in relation to memory consumption?

I had tested the ZSTD with the kvmalloc code you'd originally pushed to the 0.8.x series and dont recall what the maximum compression was in it, but i dont remember it hitting anything like this.

Yeah, thats because actually a lot of content didn't get compressed at all I think, or the failover allocation code that was added afterwards... Or another (thought to be) unrelated change...

edit If 16 hapens to be okey, i'm open to discussing 3 options:

  1. Lock out 17, 18 or 19 completely (not prefered)
  2. Put 17, 18 and 19 behind an unlock flag (because it could still be a great addition for system that can handle it)
  3. Automatically detecting Threads and RAM on init and blocking the higher ZSTD levels if the system is not able to handle it well.

In all cases: We should document two things: 1.A minimum recommended system requirement (i'm amazed there is not one ram advice in the docs) 2. The fact high ZSTD levels can cause significant ram pressure

PrivatePuffin avatar Dec 22 '20 18:12 PrivatePuffin

@sempervictus I've come across some changes upstream to the relevant codepath, you might wanna try my 1.4.8 PR-build just to be sure (#11367)

PrivatePuffin avatar Dec 22 '20 19:12 PrivatePuffin

@Ornias1993: There's no fuss, there's a stated concern about having release code which can do this out to the public right now, compounded by the fact that level 19 is not supposed to be used while being presented to users. A DoS is a security problem, and security workflows cut off the potential access vectors to the bug first and then fix it while attackers cant get to it (in this case our most likely adversary is a user accidentally moving their entire life's backup into a dataset configured this way and having it go corrupt, but the reds are probably reading this - issue trackers are free targeting guides :) ) Agreed that it needs diagnostic and redress of the root cause, simply saying that while the bug is known to exist, a quick patch to prevent users setting any level above 14 or whatever's considered "safe for all consumers" should go into the release branch as a safety measure for users installing from DKMS RPMs or whatever their process is for this. Time == faulty installs which may == lost data/time/money. We work in enterprise space where something like this being left in the release code does not help build confidence among those people for FOSS. @BrainSlayer: there is no swap in play anywhere, host system is all ZFS with 64G on a gen8 i9HK, point was to run constrained. I had tested the ZSTD with the kvmalloc code you'd originally pushed to the 0.8.x series and dont recall what the maximum compression was in it, but i dont remember it hitting anything like this. We tend to run these "exhaust the space and try to break it" tests before we upgrade production pools to see if there are edge cases ZTS doesn't address (like leaky space maps). Since its kernel code, testing in containers is rough, and test metal needs a full offload of all other test workloads to run this on it. I can stand up a physical test host over the break if it gets that far, but for now at least the most easily available environment is KVM. I re-ran the test at zstd=9 for 30m with no crash, so thats a win. Running at 14 currently, will bump it to 16 next if this doesnt die miserably.

you missed one point. there is host involved virtual memory. its running under QEMU and not with a native system

BrainSlayer avatar Dec 22 '20 19:12 BrainSlayer

@BrainSlayer Agreed, that would be next test step: So please supply the following information:

  1. What ZSTD levels are and are not causing these issues? (which you are already testing, which is nice)
  2. Are you having the same with GZIP (try highest first, if thats okey i'll believe it to be alright)?
  3. Are you running on the host? What is the host running?
  4. What is the type of storage you are using?
  5. Could you take some memory consumption readings from your hypervisor so we can time it in relation to memory consumption?
  6. Please test the same issue directly on hardware

PrivatePuffin avatar Dec 22 '20 20:12 PrivatePuffin

i can just say that i tested zstd-19 with a 512mb memory constrained device with no issues. but it was real hardware. not a emulator

BrainSlayer avatar Dec 22 '20 20:12 BrainSlayer

i can just say that i tested zstd-19 with a 512mb memory constrained device with no issues. but it was real hardware. not a emulator

Yeah... We both did a lot of contrained testing, also tested the failback allocators one by one by force...

PrivatePuffin avatar Dec 22 '20 20:12 PrivatePuffin

Did you test the latest allocator change constrained though?

PrivatePuffin avatar Dec 22 '20 20:12 PrivatePuffin

So far i've cycled through 9,13,16, and currently running 18. The next tests will be conducted at 8 cores available to the VM with 1G of memory to try and induce the behavior. GZIP has never caused these issues in the past - the test was run for gzip-9 as part of the 2.0 testing The VM storage is a virtio-scsi block device mapped to a qcow2 on the host's ZFS pool on a 2T nvme. Stats are showing its clearly cpu-bound - no iowait to speak of, kernel threads eating up 85%+ of CPU. I'll track the VM's memory usage, from the host. I will test on hardware when i have time to do so, which probably will not be until the weekend.

sempervictus avatar Dec 22 '20 20:12 sempervictus

@sempervictus Awesome, thats pretty similair as to how we did our tests a few months ago 👍

PrivatePuffin avatar Dec 22 '20 20:12 PrivatePuffin

The actual error at the top of the message indicates an access boundary violation trying to hit that address, which manifests as a page fault. The address is on CR2 in the stack trace. You guys are much more familiar with the code - any chance something is being derefenced from the wrong place and used as a pointer?

sempervictus avatar Dec 22 '20 20:12 sempervictus

Did you test the latest allocator change constrained though?

the test results where the start for developing the fallback handlers since i was running in allocation deadlocks and i retested it all the time while developing. i'm also running zstd on embedded systems (wifi routers) with even lower memory constraints still today

BrainSlayer avatar Dec 22 '20 20:12 BrainSlayer

Someone much more expert than me in these matters (Brad Spengler) noted that this is being caused by a null in ZSTD_matchState_t being treated as a pointer. I'm building a debug kernel now to try and track down states right before this occurs. He also mentioned that the functions involved have changed in 1.4.7 so this might already be fixed upstream. Whats the plan for ZSTD version bumps? i gather thats why dedup+zstd isn't a great idea, but if backports are happening from upstream will the significant changes be logged in this repo or is it planned as more of "we bumped to vX.Y, please read their changelog, and here's how we deviated from them?"

sempervictus avatar Dec 22 '20 21:12 sempervictus

@sempervictus You can check my 1.4.8 PR to make sure it's fixed, but indeed... the upstream code has been evolved quite a bit.

If there are any relevant highlights, I tried to add them to the changelog on our PR as well. I only noticed the speed improvement on smaller blocksizes though. But if this bug is indeed solved by upstream, I'll make note in the Update PR... In theory we did leave room for handling breaking changes in the ZSTD library, but indeed... small changes mean that new blocks might be different than the current blocks in your dedupe table.

PrivatePuffin avatar Dec 22 '20 21:12 PrivatePuffin

@Ornias1993: 1.4.8 appears to fix the problem - thank you. 3h of running the original test scenario with nothing adverse in dmesg. Tests still running, zloop running in parallel for cross coverage on the same VM. Do the buildbots have memory and other sanitizer builds to provide more generic coverage for stuff like this? Stress testling like what i'm doing currently can show some issues, but given the complexity of ZSTD coupled with the fact that its running at the same tier as the core memory allocator, it might be handy to have sanitizers watch it get fuzzed for coverage of operating conditions and data inputs - are there ways for instance to internally "zip-bomb" a compressor/decompressor implementation given the caveats of running in ring0 under additional pressures?

sempervictus avatar Dec 23 '20 04:12 sempervictus

So close. Adding zloop to the mix seems to have raised enough pressure to trip it again:

Dec 22 23:10:15 svl-arch00 kernel: BUG: unable to handle page fault for address: ffffffffffffff24
Dec 22 23:10:15 svl-arch00 kernel: #PF: supervisor write access in kernel mode
Dec 22 23:10:15 svl-arch00 kernel: #PF: error_code(0x0002) - not-present page
Dec 22 23:10:15 svl-arch00 kernel: PGD 4040c067 P4D 4040c067 PUD 4040e067 PMD 0 
Dec 22 23:10:15 svl-arch00 kernel: Oops: 0002 [#1] SMP
Dec 22 23:10:15 svl-arch00 kernel: CPU: 2 PID: 420 Comm: z_wr_iss Tainted: G                T 5.4.85 #1
Dec 22 23:10:15 svl-arch00 kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
Dec 22 23:10:15 svl-arch00 kernel: RIP: 0010:zfs_ZSTD_updateStats+0x67/0xc0
Dec 22 23:10:15 svl-arch00 kernel: Code: 48 39 d6 75 ec 43 8d 04 36 01 45 30 41 83 fe 3f 76 60 45 0f bd f6 41 83 c6 13 48 8b 45 08 41 83 c5 01 41 83 ec 03 45 0f bd ed <42> 83 04 b0 01 48 8b 45 18 4d 63 ed 83 45 34 01 42 83 04 a8 01 83
Dec 22 23:10:15 svl-arch00 kernel: RSP: 0000:ffffaaa683de3880 EFLAGS: 00010202
Dec 22 23:10:15 svl-arch00 kernel: RAX: ffffffffffffff20 RBX: ffffaaa69597d86c RCX: 0000000000000542
Dec 22 23:10:15 svl-arch00 kernel: RDX: ffffaaa69597d86d RSI: ffffaaa69597d86d RDI: ffff9f1c2062d3b0
Dec 22 23:10:15 svl-arch00 kernel: RBP: ffff9f1c2062d3b0 R08: 0000000000000004 R09: 0000000000000001
Dec 22 23:10:15 svl-arch00 kernel: R10: ffff9f1c2062d288 R11: 0000000000000004 R12: 0000000000000001
Dec 22 23:10:15 svl-arch00 kernel: R13: 000000000000000b R14: 0000000000000001 R15: 0000000000000001
Dec 22 23:10:15 svl-arch00 kernel: FS:  0000000000000000(0000) GS:ffff9f1c3bb00000(0000) knlGS:0000000000000000
Dec 22 23:10:15 svl-arch00 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 22 23:10:15 svl-arch00 kernel: CR2: ffffffffffffff24 CR3: 00000001192a4006 CR4: 0000000000360ee0
Dec 22 23:10:15 svl-arch00 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 22 23:10:15 svl-arch00 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Dec 22 23:10:15 svl-arch00 kernel: Call Trace:
Dec 22 23:10:15 svl-arch00 kernel:  zfs_ZSTD_compressBlock_btultra2+0x15ba/0x4bf0
Dec 22 23:10:15 svl-arch00 kernel:  zfs_ZSTD_compressBlock_internal+0x1d1/0x4d0
Dec 22 23:10:15 svl-arch00 kernel:  zfs_ZSTD_compressContinue_internal+0x667/0x9b0
Dec 22 23:10:15 svl-arch00 kernel:  zfs_ZSTD_compressEnd+0x35/0x380
Dec 22 23:10:15 svl-arch00 kernel:  ZSTD_compressStream_generic+0x42e/0x4f0
Dec 22 23:10:15 svl-arch00 kernel:  zfs_ZSTD_compress2+0xe3/0x280
Dec 22 23:10:15 svl-arch00 kernel:  zfs_zstd_compress+0xdd/0x170
Dec 22 23:10:15 svl-arch00 kernel:  zio_compress_data+0xc1/0xf0
Dec 22 23:10:15 svl-arch00 kernel:  zio_write_compress+0x44b/0x8f0
Dec 22 23:10:15 svl-arch00 kernel:  ? __wake_up_common_lock+0x86/0xb0
Dec 22 23:10:15 svl-arch00 kernel:  zio_execute+0x8a/0x130
Dec 22 23:10:15 svl-arch00 kernel:  taskq_thread+0x2ce/0x510
Dec 22 23:10:15 svl-arch00 kernel:  ? try_to_wake_up.isra.0+0x390/0x390
Dec 22 23:10:15 svl-arch00 kernel:  ? zio_destroy+0xe0/0xe0
Dec 22 23:10:15 svl-arch00 kernel:  ? taskq_thread_spawn+0x60/0x60
Dec 22 23:10:15 svl-arch00 kernel:  kthread+0x10f/0x130
Dec 22 23:10:15 svl-arch00 kernel:  ? kthread_associate_blkcg+0x90/0x90
Dec 22 23:10:15 svl-arch00 kernel:  ret_from_fork+0x1f/0x40
Dec 22 23:10:15 svl-arch00 kernel: Modules linked in: intel_rapl_msr intel_rapl_common intel_pmc_core_pltdrv intel_pmc_core kvm_intel kvm irqbypass crct10dif_pclmul ghash_clmulni_intel joydev rapl mousedev qxl ttm drm_kms_helper evdev fb_sys_fops input_leds syscopyarea uio_pdrv_genirq led_class sysfillrect uio psmouse pcspkr mac_hid i2c_piix4 sysimgblt qemu_fw_cfg button sch_fq_codel drm fuse ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 ata_generic pata_acpi virtio_net net_failover virtio_balloon virtio_blk failover serio_raw atkbd libps2 crc32_pclmul crc32c_intel uhci_hcd aesni_intel ehci_pci ehci_hcd glue_helper crypto_simd cryptd usbcore virtio_pci usb_common ata_piix floppy i8042 serio
Dec 22 23:10:15 svl-arch00 kernel: CR2: ffffffffffffff24
Dec 22 23:10:15 svl-arch00 kernel: ---[ end trace 0f2e9221cf73c222 ]---
Dec 22 23:10:15 svl-arch00 kernel: RIP: 0010:zfs_ZSTD_updateStats+0x67/0xc0
Dec 22 23:10:15 svl-arch00 kernel: Code: 48 39 d6 75 ec 43 8d 04 36 01 45 30 41 83 fe 3f 76 60 45 0f bd f6 41 83 c6 13 48 8b 45 08 41 83 c5 01 41 83 ec 03 45 0f bd ed <42> 83 04 b0 01 48 8b 45 18 4d 63 ed 83 45 34 01 42 83 04 a8 01 83
Dec 22 23:10:15 svl-arch00 kernel: RSP: 0000:ffffaaa683de3880 EFLAGS: 00010202
Dec 22 23:10:15 svl-arch00 kernel: RAX: ffffffffffffff20 RBX: ffffaaa69597d86c RCX: 0000000000000542
Dec 22 23:10:15 svl-arch00 kernel: RDX: ffffaaa69597d86d RSI: ffffaaa69597d86d RDI: ffff9f1c2062d3b0
Dec 22 23:10:15 svl-arch00 kernel: RBP: ffff9f1c2062d3b0 R08: 0000000000000004 R09: 0000000000000001
Dec 22 23:10:15 svl-arch00 kernel: R10: ffff9f1c2062d288 R11: 0000000000000004 R12: 0000000000000001
Dec 22 23:10:15 svl-arch00 kernel: R13: 000000000000000b R14: 0000000000000001 R15: 0000000000000001
Dec 22 23:10:15 svl-arch00 kernel: FS:  0000000000000000(0000) GS:ffff9f1c3bb00000(0000) knlGS:0000000000000000
Dec 22 23:10:15 svl-arch00 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 22 23:10:15 svl-arch00 kernel: CR2: ffffffffffffff24 CR3: 00000001192a4006 CR4: 0000000000360ee0
Dec 22 23:10:15 svl-arch00 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 22 23:10:15 svl-arch00 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 000000000000040

sempervictus avatar Dec 23 '20 04:12 sempervictus

Do the buildbots have memory and other sanitizer builds to provide more generic coverage for stuff like this? Stress testling like what i'm doing currently can show some issues, but given the complexity of ZSTD coupled with the fact that its running at the same tier as the core memory allocator, it might be handy to have sanitizers watch it get fuzzed for coverage of operating conditions and data inputs - are there ways for instance to internally "zip-bomb" a compressor/decompressor implementation given the caveats of running in ring0 under additional pressures?

I don't think it does, but we might actually want to add your scenario as a test... Could you also supply an example of your "Adding zloop to the mix" scenario? @behlendorf This is your department I think

So close. Adding zloop to the mix seems to have raised enough pressure to trip it again: Well, it at least seems to be a different code path this time and takes more effort... Which is a good thing!

@Cyan4973 and @c0d3z3r0 Can this be upstream related, or more like: Our codepath related?

PrivatePuffin avatar Dec 23 '20 11:12 PrivatePuffin