zfs icon indicating copy to clipboard operation
zfs copied to clipboard

ashift=18 needed for NVMe with physical block size 256k

Open michaelfuckner opened this issue 3 years ago • 4 comments

zfs 2.1.4 on Proxmox 7.2 ( 5.15.30-2-pve) -->

Describe the problem you're observing

Is it correct I should use ashift=18 for these drives to run them as mirror? 16 seemsto be the highest number

Describe how to reproduce the problem

 root@prox1:~# zpool status
  pool: tank
 state: ONLINE
status: One or more devices are configured to use a non-native block size.
        Expect reduced performance.
action: Replace affected devices with devices that support the
        configured block size, or migrate data to a properly configured
        pool.
config:

        NAME           STATE     READ WRITE CKSUM
        tank           ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            nvme0n1p1  ONLINE       0     0     0  block size: 65536B configured, 262144B native
            nvme1n1p1  ONLINE       0     0     0  block size: 65536B configured, 262144B native

errors: No known data errors
root@prox1:~# cat /sys/devices/pci0000:5d/0000:5d:00.0/0000:5e:00.0/nvme/nvme0/nvme0n1/queue/physical_block_size
262144
root@prox1:~# cat /sys/devices/pci0000:5d/0000:5d:00.0/0000:5e:00.0/nvme/nvme0/nvme0n1/queue/logical_block_size
512
root@prox1:~# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     22103782****         Micron_7450_MTFDKCB3T8TFR                1           1.77  TB /   3.84  TB    512   B +  0 B   E2MU110
/dev/nvme1n1     22103783****         Micron_7450_MTFDKCB3T8TFR                1           1.77  TB /   3.84  TB    512   B +  0 B   E2MU110
 


### Include any warning/errors/backtraces from the system logs
<!--
*IMPORTANT* - Please mark logs and text output from terminal commands
or else Github will not display them correctly.
An example is provided below.

Example:

this is an example how log text should be marked (wrap it with ```)

-->

michaelfuckner avatar Sep 18 '22 11:09 michaelfuckner

You should not wish 256KB ashift. It would be too space inefficient in most cases. Just recently my https://github.com/openzfs/zfs/pull/13798 was merged to master to improve this area. I think it should do the right thing for you,

amotin avatar Sep 18 '22 11:09 amotin

The highest ashift that the disk format can handle is ashift=17, but going that high would kill the uberblock history and would likely require code changes to work semi-reliably (since the current code might not be able to import the pool following a code boot when ashift=17 is used because it relies on the uberblock history existing). Doing ashift=18 would require a disk format change.

As @amotin said, I do not think that ashift=18 is the answer here.

ryao avatar Sep 19 '22 21:09 ryao

so what is your suggestion? Use it as is, get new drives or ask the vendor if it is possible to reformat to a smaller block size?

michaelfuckner avatar Sep 20 '22 06:09 michaelfuckner

Recreate the pool with ashift=12 and use it that way. The drive is designed to support 4K random IO, although in terms of pure 4K random writes, it is suboptimal:

https://www.storagereview.com/review/micron-7400-pro-ssd-review

If you are not doing random IO, I suggest setting a 1M recordsize so that most data writes will be full physical page writes.

As for redoing the low level formatting, that is not really possible due to how flash works internally. Thankfully, since 4K is such a common IO size, flash drive firmware is designed to handle it in a performant way, despite flash physical page sizes becoming insane.

ryao avatar Sep 20 '22 12:09 ryao

The issue is solved with firmware E2MU200 from https://www.micron.com/products/ssd/firmware-downloads

With this firmware, the Linux Kernel reports a physical block size of 4.096 for those drives. You can find example outputs regarding the physical block size here: https://www.thomas-krenn.com/de/wiki/NVMe_physical_block_size#Beispiel_einer_NVMe_SSD (currently German only, English version might follow)

tk-wfischer avatar May 22 '23 13:05 tk-wfischer