dcache
dcache copied to clipboard
issues with btrfs
Hi.
Not sure whether dCache should really take care on this or not, but given the recent discussions with the problems in Göttingen I have thought, that the default pool gap of (in most cases) 4 GB might be to small for CoW filesystems (e.g. btrfs or ZFS).
Given the nature of CoW filesystem, they tend to have problems when filesystems get pretty full. Even deletion of files will then require some space first, which can in principle result in a deadlock situation (though btrfs, and presumably zfs as well, have some measurements against this (the global reserve)).
Yet, there are other "default operations" which can cause similar problems, in btrfs I would mostly think about "balance". I thought to remember that the max extent size for btrfs was 1GB, but looking at
ctree.h:#define BTRFS_MAX_EXTENT_SIZE SZ_128M
it seems to be 128MB.
Yet a 4GB gap might still become a little tight on btrfs.
Cheers, Chris.
Hi Chris,
Thanks for the useful information!
Would an alternative be to explicitly configure the pool size (e.g., the pool.size configuration property) to be less than 100% of the available capacity?
The reason I mention this alternative is I always imaged the main purpose of the pool's gap is to support uploads where dCache doesn't know how much data to expect. The set gap admin command help says:
New transfers will not be assigned to a pool once it has less free space than the gap. This is to ensure that there is a reasonable chance for ongoing transfers to complete. To prevent that writes will fail due to lack of space, the gap should be in the order of the expected largest file size multiplied by the largest number of concurrent writes expected to a pool, although a smaller value will often do.
It is not an error for a pool to have less free space than the gap.
So, the default value was based on assuming pools tend to accept a single upload at a time, and files are about 4 GiB in size.
That said, dCache currently only selects pools for an upload that have sufficient space to accept the file without encroaching into the gap, so the effect is as you describe: setting the gap keeps a certain amount of space unused, if at all possible.
Hey Paul.
Admins typically don't want to set pool.size and prefer some auto discovery way... ;-)
In the meantime (since several months now), we use btrfs on some 18 pool nodes with each 16*14 TB...and actually, things seem much worse with btrfs than I've expected.
Of course we could now just say... screw btrfs for dCache... but it does has some pretty nice features (mostly the checksuming for our purposes)... and many distros choose it now as their default (I think SUSE, Fedora... so good chance that RHEL will pick it up too, sooner or later - despite their denials).
So far I've noted two issues...
-
Is what I describe in the extra issue #6354 ... I guess it might be reasonable for other filesystems, too.
-
Is more tricky to explain
We recently had to pools where dCache crashed and could not start up again.
The reason was ENOSPC, despite df and even btrfs's tools themselves showed some ~800 GiB unsused space.
A bit of explanation on who btrfs operates (AFAIU)... from the actual device space... it allocates block groups for either data or meta-data (unless one uses mixed block groups which is however not recommended for larger fs) in larger chunks (IIRC up to 1GiB). These are then reserved for the respective purpose (either data or meta-data). They can be even empty... or partially or fully filled with the respective type of data.
What we had, was that all unallocated space was allocated as either data or meta-data block groups.... and while the data block groups still had some ~800 GB free ... the metadata block groups were full except for some strictly reserved part which cannot be used.
That had then 2 consequences:
- dCache thought there was still much space left... and while there was for data, there was't for metadata, so any writes got ENOSPC because no meta-data could be written anymore
- dCache itself crashed/couldn't start again, because of dCache's own metadata/lockfiles writes failed, too.
The later was because, for the new pools I had original put dCache's metadata simply in the same fs than the pool data. I hadn't done that in the past, but with ext4 we never came close to completely filling a pool, so I figured why the extra work.
I had some conversation with upstream: https://lore.kernel.org/linux-btrfs/CAHzMYBR9dFVTw5kJ9_DfkcuvdrO4x+koicfiWgVNndh8qU2aEw@mail.gmail.com/T/#t
It seems the excessive use of meta-data storage comes from large files and the checksums btrfs stores for them. That in combination with some unfortunate order of writing/deleting files and fragmentation of the data-block groups.
I guess there's only few things that can be done on the dCache side.
Some people indicated that a problem would be if an application pre-allocates the storage of large files (before these are written). Is dcache doing that?
How does dCache detect how much space is still free on a fs?
How does it react if it get's ENOSPC? Does it still continue to try writing to such pool? But even simply not doing so... wouldn't really be a good solution... as the ENOSPC could go away.
In principle dCache could also try to detect whether btrfs is used, and if so... use btrfs' more detailed tools for disk usage. But I guess even that would be hard to make really generically.
Hi Chris,
That's certainly interesting work, trying to get btrfs working. My impression is that, a number of years ago, btrfs looked very promising, possibly eclipsing zfs; however, the momentum seems to have stalled. Today, I'm not sure if there's any compelling reason to use btrfs over zfs.
As a quick work-around, my initial guess is that perhaps setting a high enough gap would fix the issue.
On to the specific questions:
Yes, dCache will pre-allocate space if the network protocol allows the pool (either directly or via the door) to know how large the file should be. This is done before the mover starts writing data to disk.
The pool auto-discovers the available free space by looking up the FileStore object for the pool's data directory. This roughly corresponds to a partition on Linux systems. dCache then calls getUsableSpace() on this object. This method should return the number of bytes available for writing data (although there's no guarantees). The FileStore object has another method getUnallocatedSpace() that returns a number that should include any over overhead.
I don't know the internal details of FileStore, but I imagine it uses statvfs(3) to obtain data. This returns a structure like:
struct statvfs {
unsigned long f_bsize; /* Filesystem block size */
unsigned long f_frsize; /* Fragment size */
fsblkcnt_t f_blocks; /* Size of fs in f_frsize units */
fsblkcnt_t f_bfree; /* Number of free blocks */
fsblkcnt_t f_bavail; /* Number of free blocks for
unprivileged users */
.... etc ....
I further imagine getUnallocatedSpace() return the value of f_bfree and getUsableSpace() returns the value of f_bavail (scaled as appropriate). If so, then the problem you described sounds like a bug in btrfs: it is returning a t_bavail value that includes both the free space for metadata and the free space for user data. I don't think that's the intended semantics of t_bavail.
I can't say for sure how dCache reacts to ENOSPC. What it should do it abort the transfer and do the post-processing step. It's protocol-specific how incomplete uploads are handled: some doors will delete the file. The dCache-internal accounting will be updated, based on the bytes that were successfully written, but the total space and free space will not be adjusted as a result of such a failure.
Personally, I wouldn't go down the route of adding btrfs-specific support in dCache unless btrfs provides some compelling feature that benefits dCache. Yes, we could add something for space accounting; however, that would only make sense if we see a large number of people using btrfs (or anticipate a large number in the future). However, right now, this looks like a bug in btrfs.
A short-term solution might be to increase the gap.
Longer-term, see if you can get btrfs to fix the space accounting information they're providing.
Hey.
Actually, I'd have said the opposite, at least for the last year or so. The Fedora switch to btrfs seems to have brought quite some momentum into btrfs development, especially also in terms of committed developers. SUSE seems to continue its longtime support and Facebook also seems to give some longterm funding.
ZFS always had (and likely always will have) the license issue, which in turn always kinda excludes it from proper upstream support from the Linux kernel community.
As a quick work-around, my initial guess is that perhaps setting a high enough gap would fix the issue.
Yes, in principle... though I wouldn't want to set a gap to some 900GB ... that's too much loss. I'd expect that some regular minimal balance may also effectively help, without putting too much IO load on the system.
Yes, dCache will pre-allocate space if the network protocol allows the pool (either directly or via the door) to know how large the file should be. This is done before the mover starts writing data to disk.
Sounds quite reasonable... TBH, I don't understand why the btrfs people said that this could become an issue for btrfs.
If so, then the problem you described sounds like a bug in btrfs
Well, free space in any CoW filesystem is a difficult concept... but I kinda agree that it should perhaps return 0 if it sees already, that no further meta-data space is available.
ZFS always had (and likely always will have) the license issue
Yes, I quite agree. Although I'm also surprised how little that seems to matter. AFAIK, it's relatively easy to get ZFS running on all major distros. IIRC, this works because it's the user combining the two pieces (Linux + ZFS) without any redistribution.
I wouldn't want to set a gap to some 900GB ... that's too much loss
Well, you're loosing 800 GiB because btrfs is reserving that amount for metadata, which isn't being used. You need to fix that problem first!
I kinda agree that it should perhaps return 0 if [no] space is available.
This is actually the primary problem.
What I didn't say earlier is that, over time, the pool will monitor the partition's free space and adjust the dCache-internal accounting accordingly. The f_bavail value doesn't have to be 100% correct.
The most fundamental problem is as you say: btrfs (through f_bavail) claims there's a non-trivial amount of free space, while "at the same time" return an errno ENOSPC on write(2). If they (the btrfs developers) updated the code so statvfs(3) return a f_bavail of zero after ENOSPC then I think we could survive.
It would be better if the "error" in f_bavail decreased as the disk started to fill up. But, even without this, I think we would have a functioning system.
Well, you're loosing 800 GiB because btrfs is reserving that amount for metadata, which isn't being used.
No it's actually reserved for data, but the meta-data run full, and so no new files can be written.
I kinda also have an idea now, why the pre-allocation (of space for a file) could be bad... but again I would consider it a problem of btrfs.
One other question came up for me from an operational PoV: Does dCache get into trouble if I create (empty) files like:
pooldir/.pool_${poolname}
pooldir/data/.pool_${poolname}
pooldir/meta/.pool_${poolname}
pooldir/control/.pool_${poolname}
?
The reason is that with btrfs one probably want's to keep data on a different fs than everything else. Especially it's not enough to just put meta and control in another fs, because when the data fs would be full (e.g. as described) above, dCache could not even start anymore, as it would try to create pooldir/lock.
So I think the best would be to mount one fs for dcache metadata on pooldir/ (or even below, if there are more pools).. and the respective pool data fs on pooldir/data/.
This has one consequence though,... on cannot use e.g.:
pool.wait-for-files=${pool.path}/data:${pool.path}/meta
to make sure, that both fs are mounted, as ${pool.path}/data would be there, even if the data fs wasn't mounted.
And for that reason I was asking whether one can create:
pooldir/.pool_${poolname}
pooldir/data/.pool_${poolname}
pooldir/meta/.pool_${poolname}
pooldir/control/.pool_${poolname}
without any harm... so that one could check for exactly these.