dat
dat copied to clipboard
Creating a dataset with many small files is painfully slow
I have a a dataset with 200k small files (~300 B each). Creating this dataset is painfully slow.
dat v13.10.0
Created new dat in /.../.dat
dat://...
Sharing dat: 25053 files (6.9 MB)
1 connection | Download 0 B/s Upload 25 KB/s
Creating metadata for 70010 files (483 B/s)
[=============-----------------------------] 31%
ADD: … (296 B)
I am syncing it to another machine. 25 KB/s is just crazy slow. Also 483 B/s for metadata creation speed?
Also 483 B/s for metadata creation speed?
Ouch ya, this is known bottleneck. It may go faster if you can put the files into subfolders with fewer files in each folder.
We have a solved some of these problems in hyperdb, which will be integrated into Dat. But until then it'll be slow unfortunately.
Sadly, datasets are not structured by me and I cannot move files around.
Will this also mean that cloning this dat repository will take so long?
Related: #915
I think this is also related to node being single-threaded and using only one CPU. It looks like this is CPU bound (dat CPU utilization is at 100% while creating metadata). I can run in parallel multiple dat instances on different datasets with many files to improve overall performance here and utilize multiple cores.
Adding same dataset to git (no git LFS):
real 0m33.932s
user 0m12.757s
sys 0m14.968s
Only 30 seconds.
Reproduction (similar dataset to one above):
$ git clone https://github.com/myleott/mnist_png.git
$ cd mnist_png
$ rm -rf .git
$ tar -xzf mnist_png.tar.gz
$ cd mnist_png
$ dat create
$ dat share