low read throughput - lots of small files

Open njenia opened this issue 4 years ago • 0 comments

In a scenario where a bucket contains a lot of small files, I find that goofys will be slow to read, compared to say, a few large files.

I tried on a t3.large EC2 instance (2 CPUs 4GB RAM) and on a t3.2xlarge8 (8 CPUs 32GB RAM), in both it took ~90secs to read ~1000 small files (total 500KB). On the other hand I tried reading 15 large files (between 1GB and 10GB each - total 30GB) and it took 739secs. In another test I saw that it took 27mins to read 25k small files. For testing I simply copied (cp) from the mounted dir to a local dir. I saw that in the large file scenario the network was fully saturated (5Gb), while with the smaller files it was a few Kbs/sec, which kind of makes sense.

I would expect that the number of files won't make such a difference and the total size will play a bigger role in the timing. But it seems that run time for read mostly depends on number of files, maybe because each file needs a context switch and has a separate thread or something of this sort, so it spends more time on context switching and management of workers instead of focusing on downloading.

Is there anything I can do to improve run time for this scenario of reading many small files.

Thanks for this great lib!!!

Jul 29 '21 18:07 njenia