zfs
zfs copied to clipboard
Install flatpak packets in OpenZFS it's too slow. OSTree over OpenZFS it's too slow.
System information
| Type | Version/Name |
|---|---|
| Distribution Name | Ubuntu |
| Distribution Version | 20.04 |
| Linux Kernel | Linux laika 5.4.0-52-generic #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
| Architecture | x86_64 |
| ZFS Version | 0.8.3-1ubuntu12.4 |
| SPL Version | 0.8.3-1ubuntu12.4 |
Describe the problem you're observing
Installing a flatpak application takes too much time surely due to the way that OSTree replicates the files using hard links.
You can observe a lot of iowait time.
Describe how to reproduce the problem
sudo -s apt update apt install flatpak time flatpak install https://dl.flathub.org/repo/appstream/ch.openboard.OpenBoard.flatpakref
Takes 30 seconds on ext4 and 6 minutes on OpenZFS.
Include any warning/errors/backtraces from the system logs
No errors just poor performance.
I can confirm this issue on Gentoo, Kernel 5.4.72, OpenZFS 0.8.5, default module parameters, two-way mirror with WD80EFZX (CMR). However, the flatpak install command doesn't finish for me even after minutes. strace shows it seems to get trapped in repeating poll([{fd=23, events=POLLIN}], 1, 300) = 0 (Timeout). I can also confirm results for ext4, and for comparison, it also works OK on an ext4 formatted loop device (file) on top of ZFS. For what it's worth: the disk is making crazy noise until the process is interrrupted. I observed a similar behavior with Docker although tasks eventually finished in this case. At the same time the rest of the system is behaving normal, especially with a populated ARC.
@behlendorf , maybe worth taking another look if there's more than just an opportunity to improve performance for an edge case? Also, I'd like to understand what else I could check on my end, thanks!
Also, I'd like to understand what else I could check on my end, thanks!
It sounds like the first order of business would be to determine exactly what operation on OpenZFS is significantly slower than ext4. From the original post it sounds like the suspicion is that it's hard links. Can you re-reun the flatpak command under strace -c for OpenZFS and ext4 so we can get a histogram of all the system call timings for comparison.
The flatpak command (install gimp from flathub-beta in this case) actually ran to completion after 65 minutes (1m24s on ext4 loop device). Stats and timing: gh11140.txt. Since sys calls don't even remotely add up to real time (14s sys calls vs 65min total time for OpenZFS and 5s vs 85s for ext4) is the conclusion that time is spent in user space and that the flaw is solely in the applications?
The strace output does show we're not spending an inordinate amount of time in OpenZFS related system calls. That's good, but I don't think it entirely lets ZFS off the hook. The application appears to be polling waiting for something to happen, figuring out what that something is would be the next step. Is there any debugging you can enable in the flatpak command which would give some indication of what it's waiting on.
I'm not sure if this is relevant but we have observed a great performance disparity installing flatpaks on SSDs with OpenZFS. Installing Gimp lasts:
On SanDisk SDSSDP12 120GB
real 14m39,336s
user 0m48,155s
sys 0m48,245s
On KINGSTON SA400S3 480GB
real 1m34,298s
user 0m21,515s
sys 0m29,661s
During installation top shows a very high (80%) iowait on every core.
If you want to figure out what is slow you need to take into account that half of a system-wide flatpak installation happens in flatpak-system-helper. The user process only does the download, the rest is the import into the system dir.
To make it easier to debug I recommend using --user to install to the users homedir instead, as that is easier to trace.
@alexlarsson, thanks for this, I've always been running flatpak install with --user. I figured that the culprit (not necessarily the root cause) in this case could be ostree rather than flatpak. I'm not a software engineer, so guidance with setting up a gdb-testbed or better commandline tools would certainly be useful.
Generally I'm not so sure this is a flatpak issue as I've observed a similar behavior - crazy disk noise and excessive iowait as if multiple processes were trying to write to the same disk all at the same time - with other applications, too, usually to a lesser degree. flatpak might be a good case to debug, though, as it seems to be hit harder than others. I'm a little surprised to see the variance even amongst NAND storage as per @vcarceler's experience, I'd have expected that this issue is a side effect of this particular workload with OpenZFS on mechanical storage.
Generally a flatpak install happens approximately like this:
Stage 1 "pull"
- mkdir stage-dir
- download required files to stage-dir/objects/*
- syncfs(stage-dir)
- rename stage-dir/object/* to repo/objects/*
- fsync(repo/objects)
Stage 2 "deploy"
- mkdir deploy-tmpdir
- foreach $file in the app
- mkdir -p deploy-tmpdir/$(dirname $file)
- hardlink from repo/objects/$object to deploy-tmpdir/$file
- syncfs(deploy-tmpdir)
- rename(deploy-tmpdir, deploy-dir)
At this point we can run the app from deploy-dir.
This is fairly correct in the --user case. However, if the installation is system-wide, then things are split in the middle, where we pull to a local sub-repo similar to stage 1, but then we call out to a system helper that imports and verifies the sub-repo into the real system repo, and then run the stage 2 from that. I imagine the poll timeout you see is the main flatpak waiting for the system-helper to run stage 2.
In terms of fs ops flatpak is fairly regular, although it does rely a fair bit on hardlinks, so if those are slow that would be a problem.
Here're some results obtained from perf record ostree --repo=repo commit --branch=foo portage/ as per https://github.com/ostreedev/ostree/issues/2227#issuecomment-726895364 (thanks @dbnicholson). This indeed triggers the issue. Here I interrupted the command after a couple minutes: ostree-commit-portage.txt. ostree version used is 2020.7, ashift was always 12 for the 4k physical sector size in my case. I've also been playing with different recordsize values 128k, 8k, 4k but to no avail.
Here are results from iperf record and strace -o
New OSTree repo: ostree --repo=repo --mode=bare-user-only init
Download Linux source code: mkdir tree; cd tree; wget https://github.com/torvalds/linux/archive/v5.10-rc3.tar.gz; tar xzf v5.10-rc3.tar.gz; cd ..
And finally a full run of perf record ostree --repo=repo commit --branch=foo tree/ produces perf.data -> https://cloud.elpuig.xeill.net/index.php/s/D3JCFuoaLdL43oT
And 17 minutes of strace -o strace.log ostree --repo=repo commit --branch=foo tree/ produces strace.log -> https://cloud.elpuig.xeill.net/index.php/s/PiS3wEHTm3oIF7D
Is this useful?
Just for comparison. The same ostree --repo=repo commit --branch=foo tree/ in a EXT4 fs mounted on a loop device runs in 17 seconds and produces:
-
perf.data -> https://cloud.elpuig.xeill.net/index.php/s/SqY1HlMQcIfOHYI
-
strace.log -> https://cloud.elpuig.xeill.net/index.php/s/N1HqCbDyZpgSD5e
With this performance's difference seems difficult to belive that OpenZFS works fine as root filesystem. But we have hundreds of computers (students classrooms and laptops) with OpenZFS as root filesystem and we didn't see any performance problem excluding this. So @dbnicholson, do you have a clue of what is particular to ostree commit from the filesystem?
What ostree does that doesn't happen with regularity in day to day use is use a lot of hardlinks. It's one of the core tenets of how ostree works. So, if hardlinks are slow on zfs, then everything ostree related will be slow. There are ways to make ostree use copies instead of links at the cost of disk usage, but I don't think any options like that are clearly exposed in flatpak. To confirm that's the case or not, you can try a test with cp -l. But I imagine the information you supplied will allow the zfs developers to determine where the real issue is.
I can't think of anything else ostree really does that would cause such a slowdown. There's filesystem syncs, but that's no different than what any application does that wants to safely handle persistent state like a web browser.
I tested cp -l and ln without noticeable performance penalty.
I understand that OSTree use a lot of hardlinks, but it uses hard links in the very first ostree commit? As I understand in the first commit without redundant data there is nothing to share and find repo -links +1 only show directories.
But this first commit performs very bad on OpenZFS. Checkouts with ostree --repo=repo checkout foo tree-checkout/ works very fast.
I also tested git commit to see if there is something in common but it works fine.
Hope zfs developers can determine the cause of such slowdown.
It shouldn't use hardlinks typically until you check something out. A single commit should not cause any hardlinks.
I noticed that in repo/objects all files have date ene 1 1970. Maybe a ostree commit make a lot of changes in files metadata?
But touch on files to change date performs well.
Yesterday I tried to reproduce with ostree v2019.5 and that behaves slightly different: flatpak install gimp (same example as above) gets stuck the first time only at around 97% during processing "org.gnome.Platform.Locale" whereas with 2020.7 this happens already at around 18-20%. Any significant changes between these versions that might help getting closer to the root cause?
Been wondering why flatpak is broken when e.g. trying to install gimp
as far as I remember (if I recall correctly) there's both at least 1 core (80-99% cpu load) (via top command) occupied and constant zio_* activity via iotop.
The progress getting stuck is kind of "random" once it was stuck at ~60% and another one at ~97%
ran into this when trying to install some gaming environment via flatpak
I wonder if zfs sync=disabled [dataset] can make a difference.
@IvanVolosyuk I tried your suggestion of disabling synchronous requests, and flatpak is now able to install applications.
Steps taken:
zfs set sync=disabled zroot/ROOT/bootenv/var/lib/flatpakflatpak install org.videolan.VLC org.videolan.VLC.Plugin.bdj org.videolan.VLC.Plugin.fdkaac org.videolan.VLC.Plugin.makemkv com.makemkv.MakeMKV
I wonder if it is faster than ext4 in a zvol or in a loopback (with sync on). This kinda confirmed my suspicion after reading the comments on the bug that OSTree might just try to sync a lot, which is probably not needed on ZFS because of the ordering guarantees.
On Thu, May 20, 2021 at 5:47 AM QORTEC @.***> wrote:
@IvanVolosyuk https://github.com/IvanVolosyuk I tried your suggestion of disabling synchronous requests, and flatpak is now able to install applications.
Steps taken:
- zfs set sync=disabled zroot/ROOT/bootenv/var/lib/flatpak
- flatpak install org.videolan.VLC org.videolan.VLC.Plugin.bdj org.videolan.VLC.Plugin.fdkaac org.videolan.VLC.Plugin.makemkv com.makemkv.MakeMKV
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/11140#issuecomment-844417596, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABXQ6HNLXH3ZB3EKRWQTPRTTOQIWJANCNFSM4TFJVQYA .
Also, I'd like to understand what else I could check on my end, thanks!
It sounds like the first order of business would be to determine exactly what operation on OpenZFS is significantly slower than ext4. From the original post it sounds like the suspicion is that it's hard links. Can you re-reun the
flatpakcommand understrace -cfor OpenZFS and ext4 so we can get a histogram of all the system call timings for comparison.
I just hit this locally on a machine with a rpool on an ancient SSD. I observed two things:
zpool iostat 1had writes pegged at 20MB/sec./proc/spl/kstat/zfs/rpool/txgsshowed txgs being done absurdly fast.
I set sync=disabled on the dataset and flatpak finished an operation that was projected to take nearly an hour in seconds.
My guess is that flatpak is using O_SYNC. If it writes in 4K chunks (for example) with O_SYNC, then we would see amplification of those writes into the record size, which is 128K by default. This would happen 32 times per 128K record. ext4 would definitely handle this better, as it would just pass through the writes to the disk and not have to deal with any write amplification.
I have not confirmed that flatpak is using O_SYNC (as I hit this while doing something else), but it fits the observations. This is technically an upstream flatpak issue, but there is some further analysis that we can do here. In specific, we need to confirm that flatpak is using O_SYNC and the sizes of the writes.
It doesn't look like O_SYNC is used, but rather that its very aggressive with fsync? https://github.com/ostreedev/ostree/blob/5523aee0829d0a4266047b21bab218618f77f46f/src/libostree/ostree-repo-commit.c#L51-L69
wow, that use of fsync screams "bad idea" to me. solving a symptom instead of the problem. :(
Neither flatpak nor ostree explicitly use O_SYNC to my knowledge. What ostree does is carefully sync objects during pulls and checkouts to ensure that both the repository and the installation are consistent. If you can get the ostree CLI (should be available on most distributions), then I think you can do a reasonable simulation with the commit and checkout builtins.
# Use a throwaway directory for the test
testdir=$(mktemp -d -p /somewhere/on/zfs)
repo="$testdir/repo"
files="$testdir/files"
checkout="$testdir/checkout"
# Setup a bare-user-only repository as flatpak does
ostree --repo="$repo" init --mode=bare-user-only
# Make a reasonable directory of files to commit and checkout
tar -xf something.tar -C "$files"
# Commit them to the test branch. This will copy the objects into the repo and do various syncs.
# Experiment with --fsync=no.
# This is roughly equivalent to the disk IO from the pull part of flatpak install.
ostree --repo="$repo" commit -b test -s test "$files"
# Checkout the commit. This will hardlink the objects from the repo and do various syncs.
# Experiment with --fsync=no
# This is roughly equivalent to the disk IO from the deploy part of flatpak install.
ostree --repo="$repo" checkout test "$checkout"
wow, that use of fsync screams "bad idea" to me. solving a symptom instead of the problem. :(
When ostree is used to handle your OS, then carefully syncing every file is something I very much want it to do. I don't want my repo to become corrupted and hence make my system unbootable.
You could make an argument that when you're installing apps with flatpak that maybe you don't care about that level of consistency. In that case you can globally disable fsync at the repo level:
# System repo
sudo ostree --repo=/var/lib/flatpak/repo config set core.fsync false
# User repo
ostree --repo=$HOME/.local/share/flatpak/repo config set core.fsync false
You could also make the argument to the flatpak project that it should create its ostree repos with fsync disabled by default.
You might also ask, if fsync is pointless on zfs, then why isn't fsync a no-op? As far as I know, ostree is syncing files in the recommended and most efficient way if you care about data consistency, so I'm curious what's a bad idea about what it's doing.
yes, what are these data consistency issues that result from not running fsync every time a file is copied? what documentation is recommending fsync after every file as the most efficient way to obtain data consistency?
hm ... as far as I know write barriers (ext4) and atomicity (from the 'ACID' philosophy - atomicity, consistency, isolation, durability; for databases or in filesystems like reiser4)
barrier=<0|1(*)> This enables/disables the use of write barriers in
barrier(*) the jbd code. barrier=0 disables, barrier=1 enables.
nobarrier This also requires an IO stack which can support
barriers, and if jbd gets an error on a barrier
write, it will disable again with a warning.
Write barriers enforce proper on-disk ordering
of journal commits, making volatile disk write caches
safe to use, at some performance penalty. If
your disks are battery-backed in one way or another,
disabling barriers may safely improve performance.
The mount options "barrier" and "nobarrier" can
also be used to enable or disable barriers, for
consistency with other ext4 mount options.
https://en.wikipedia.org/wiki/ACID https://en.wikipedia.org/wiki/Atomicity_(database_systems)
https://en.wikipedia.org/wiki/Reiser4
Neither flatpak nor ostree explicitly use
O_SYNCto my knowledge. What ostree does is carefully sync objects during pulls and checkouts to ensure that both the repository and the installation are consistent.
I am not sure why I saw symptoms of severe write amplification if that is what it was doing. I don’t have time to look more deeply at the moment.
wow, that use of fsync screams "bad idea" to me. solving a symptom instead of the problem. :(
When ostree is used to handle your OS, then carefully syncing every file is something I very much want it to do. I don't want my repo to become corrupted and hence make my system unbootable.
It is preferable for package managers to use syncfs after writing a large number of files to calling fsync on every file. If there is a crash during this, the package manager would need to deal with it as if it were repeating everything anyway, so using fsync so zealously does not really provide any benefit.
What ostree does in the default case when committing is:
- Download a bunch of files to a temporary directory
syncfson the temporary directory- Rename all the objects into the real objects directory. This is content addressable by sha256sum with a 2 level split at after the 2nd character of the checksum. I.e.
objects/1f/1482d1df7720a719c9b2a5f62f58db785fbbdef7feb78d0f3a3b1cf495e37e.fileis a potential object path. Therefore, there are potentially 256 object subdirectories. fsynceach of the objects subdirectories and the objects directory itself to ensure the renames and potential new subdirectories are on disk.
In the comment pointed to above, it talks about an optional per_object_fsync mode where each individual object is fsyncd. That isn't the default, though.
When checking out, ostree does no syncing on its own by default and flatpak does a single syncfs on the checkout directory.
It is preferable for package managers to use syncfs after writing a large number of files to calling fsync on every file. If there is a crash during this, the package manager would need to deal with it as if it were repeating everything anyway, so using fsync so zealously does not really provide any benefit.
This is more or less exactly what it does, like I said half a year ago: https://github.com/openzfs/zfs/issues/11140#issuecomment-726257799. The main difference is that after the syncfs we also fsync() some directories to push directory metadata to disk.
I'm less than impressed with the constant blaming on ostree and speculating what it may do wrong in this issue, with little actual analysis of the issue (or apparently even reading the replies).