UniIR icon indicating copy to clipboard operation
UniIR copied to clipboard

How to fastly extract the dataset

Open Raion-Shin opened this issue 1 year ago • 2 comments

I downloaded the .tar.gz file in https://huggingface.co/datasets/TIGER-Lab/M-BEIR, but it's really large and the pv command shows that I need 2.5 days to extract the file! Can you provide smaller zip files that package each dataset into a zip file? Thanks very much!

Raion-Shin avatar Sep 02 '24 12:09 Raion-Shin

After downloading the .tar.gz files, use the following command to combine the files into a single file: sh -c 'cat mbeir_images.tar.gz.part-00 mbeir_images.tar.gz.part-01 mbeir_images.tar.gz.part-02 mbeir_images.tar.gz.part-03 > mbeir_images.tar.gz'

Next extract images from the combined file: tar -xzf mbeir_images.tar.gz

It will not take 2.5 days. I was able to complete the whole process in just 10 hrs

nrdyava avatar Sep 30 '24 22:09 nrdyava

After downloading the .tar.gz files, use the following command to combine the files into a single file: sh -c 'cat mbeir_images.tar.gz.part-00 mbeir_images.tar.gz.part-01 mbeir_images.tar.gz.part-02 mbeir_images.tar.gz.part-03 > mbeir_images.tar.gz'

Next extract images from the combined file: tar -xzf mbeir_images.tar.gz

It will not take 2.5 days. I was able to complete the whole process in just 10 hrs

Thanks. But I'm extracting it with a 2-core CPU, so it takes a long time. It'll be better if you split it into many smaller zip files.

Raion-Shin avatar Nov 21 '24 01:11 Raion-Shin