dat icon indicating copy to clipboard operation
dat copied to clipboard

Creating a dataset with many small files is painfully slow

Open mitar opened this issue 6 years ago • 6 comments

I have a a dataset with 200k small files (~300 B each). Creating this dataset is painfully slow.

dat v13.10.0
Created new dat in /.../.dat
dat://...
Sharing dat: 25053 files (6.9 MB)

1 connection | Download 0 B/s Upload 25 KB/s

Creating metadata for 70010 files (483 B/s)
[=============-----------------------------] 31%
ADD: … (296 B)

I am syncing it to another machine. 25 KB/s is just crazy slow. Also 483 B/s for metadata creation speed?

mitar avatar Feb 21 '18 16:02 mitar

Also 483 B/s for metadata creation speed?

Ouch ya, this is known bottleneck. It may go faster if you can put the files into subfolders with fewer files in each folder.

We have a solved some of these problems in hyperdb, which will be integrated into Dat. But until then it'll be slow unfortunately.

joehand avatar Feb 21 '18 16:02 joehand

Sadly, datasets are not structured by me and I cannot move files around.

Will this also mean that cloning this dat repository will take so long?

mitar avatar Feb 21 '18 17:02 mitar

Related: #915

mitar avatar Feb 22 '18 00:02 mitar

I think this is also related to node being single-threaded and using only one CPU. It looks like this is CPU bound (dat CPU utilization is at 100% while creating metadata). I can run in parallel multiple dat instances on different datasets with many files to improve overall performance here and utilize multiple cores.

mitar avatar Feb 23 '18 05:02 mitar

Adding same dataset to git (no git LFS):

real	0m33.932s
user	0m12.757s
sys	0m14.968s

Only 30 seconds.

mitar avatar Feb 27 '18 17:02 mitar

Reproduction (similar dataset to one above):

$ git clone https://github.com/myleott/mnist_png.git
$ cd mnist_png
$ rm -rf .git
$ tar -xzf mnist_png.tar.gz
$ cd mnist_png
$ dat create
$ dat share

mitar avatar Feb 27 '18 23:02 mitar