ability to handle large, flat directories without blowing up memory
Even mature backup programs like restic have memory issues with large numbers of files in one dir (think an S3 bucket, or backing up a mail program's data directory). For example:
https://github.com/restic/restic/issues/2446
These problems are often due to trying to process an entire directory at once, rather than part by part.
For instance, using os.ReadDir() is going to load all directory entries and sort them before returning anything, which could be problematic; https://pkg.go.dev/os#ReadDir
I see four uses at the current origin main HEAD 506a90863ce5ef1c70a8e42cb52a8b791eec3c65
~/go/src/github.com/PlakarKorp/plakar (main) $ ack os.ReadDir
ack os.ReadDir
cmd/plakar/utils/utils.go
529: dirEntries, err := os.ReadDir(normalizedPath)
snapshot/restore_test.go
52: files, err := os.ReadDir(exporterInstance.Root())
storage/backends/fs/buckets.go
54: bucketsDir, err := os.ReadDir(buckets.path)
67: entries, err := os.ReadDir(path)
~/go/src/github.com/PlakarKorp/plakar (main) $
Testing for and handling a large number of files in one directory is common enough that it deserves its own set of test cases.
Ok, so here's the good news:
The issue restic (and kopia) face with regard to memory issues with large directories has been solved in plakar which completely had them until June 2024, partially until then, and no longer anymore thanks to algorithmic changes + packfile-backed btree + caching through databases using memory-indexes to disk objects. You should be able to backup several millions files spread across a filesystem or all part of a single directory with no resources issues on plakar and with very similar performances.
The three cases you pointed are still valid though they are not as deeply ingrained in the backup phase:
The utils.go part is to ensure that the root pathname is normalized to its proper case on case-insensitive filesystems before beginning a backup (ie: if I enter ~/Wip by doing cd ~/wip, it works on my macOS, but things get weird as pathnames are relative to ~/wip but return ~/Wip in some system calls). I will see how I can implement it without ReadDir.
The second case is just a restore test to validate restore works with a single file, so it's not going to be an issue, I'll think of a way to handle this differently in tests though as bigger tests would be an issue and we need them too.
The third case is part of the repository code and might actually be slightly problematic in the event where all packfiles are listed by the client (not part of the actual backup process) so I'll investigate too.
Thanks
I will see how I can implement it without ReadDir.
If you open the directory as a file (getting a file descriptor first), then there is an os.File method of the same name, ReadDir, but with very different properties-- it does not sort, and gives back only a small batch at a time if a batch size > 0 is requested, so one can make multiple calls and handle a small batch at a time. By "directory order", the docs mean not sorted but simply in the order they are found (probably creation order--but the fastest order without any sorting applied is what is meant).
https://pkg.go.dev/os#File.ReadDir
func (f *File) ReadDir(n int) ([]DirEntry)
ReadDir reads the contents of the directory associated with the file f and returns a
slice of DirEntry values in directory order. Subsequent calls on the same file
will yield later DirEntry records in the directory.
If n > 0, ReadDir returns at most n DirEntry records...
examples:
https://github.com/glycerine/b3/blob/master/walk.go#L49
Also, how to scan directories in parallel:
https://github.com/glycerine/parallelwalk
The last consumer of os.ReadDir outside of controlled tests is gone now with #1126