bupstash
bupstash copied to clipboard
Introduce subdirectories inside of data/ to reduce the number of files in a single directory
Right now all of the data files in a repository go into a single directory (data/
). Given an average chunk size of 1MB, a 10TB repository would mean 10 million files end up in a single directory.
Some tools can struggle to handle very large numbers of files in a directory, if they try to store information about all files at once in memory. (Similar to #314, although in that case it took a lot more than 10 million files to cause issues.) For instance, ls
on millions of files is a bad idea. And at some point it's apparently possible to hit actual filesystem limits; I ran across this example of someone experiencing issues on ext4 with a directory that had around 5 million files: https://adammonsen.com/post/1555/
What about using a hierarchical approach, with a few levels of subdirectories inside of data/
to split things up based on the leading characters in each filename? For instance, a file named 0123456789abcdef
would be at the path data/0/1/2/0123456789abcdef
.
This would reduce the expected number of files in any single leaf directory by a factor of 16^3 = 4096
, so for instance a 10TB repository would have around 2400 files per leaf directory.
Thoughts?
Something like this is definitely coming - It's a good idea - multi TB repositories are becoming more and more common.
+1
git uses hash prefix subdirs using the first 2 hex characters (8 bits) of the hash (a hash such as efdeadbeef would be stored in ef/deadbeef).
For older filesystems, this helps avoid limits on the maximum number of files per directory, and performance bottlenecks dealing with large directories (large dirs can be really painful in filesystems that don't use b-trees).
Even for newer filesystems that have efficient tree lookup and insertion, using hash prefix subdirs can still be a great help because it can reduce filesystem lock contention. If N threads all want to add a new file to a directory, they can all end up serializing on a filesystem write lock for that directory. Spreading the files out over 256 subdirs (00 through ff) is an easy way to give those N threads 256 possible directory locks instead of just 1, which greatly reduces the chances that two or more of them will contend for the same lock in the filesystem code.
The main concern is how to deal with directory fsyncing - its actually sort of an interesting challenge. Adding 256 dirs means we need to have a minimum of 256 open file handles to follow the strictest fsync semantics, this can mean we go above the default ulimit.
A likely solution is going to be configurable with a lower default.
The default soft limit is low so that people notice fd leaks, and because most processes don't need to keep a lot of files open, but a process that does need a lot of open files can increase the limit by calling setrlimit(RLIMIT_NOFILE, ...)
, up to the hard limit (which only root can change).
The hard limits are high enough to not seem like a problem to me, at least on the macOS and Linux systems I have at the moment:
macOS 11.7.2: ulimit -Hn
== unlimited
Arch Linux: ulimit -Hn
== 524288
#130
Just wanted to second this request as I have terabytes of data that I would like to backup using bupstash. I am hesitant to proceed though due to the extreme number of files that would be generated in one folder.
Have been a bit busy, but this change will be coming. I have an implementation in progress,
I just ran into an issue with about 5.2 million files in the data directory with ext4 and was scratching my head for a minute since there was still plenty of space and inodes left, I managed to get it working again with tune2fs -O large_dir [block-device]
.