mimic-code icon indicating copy to clipboard operation
mimic-code copied to clipboard

Faster way to download the dataset?

Open MRJasonP opened this issue 3 years ago • 6 comments

Thank you for the amazing work, and for sharing the valuable datasets. I have registered as a credential user and followed the instruction to download the dataset by "wget -r -N -c -np --use ...." command. I am able to download the dataset, but it seems it is downloading as a directory (image by image, file by file, folder by folder), it is too slow given the total dataset size is 4.6TB. I am wondering if there is a link/method that I can download everything in a single zip file, I believe it will be much faster and more convenient.

Thank you

MRJasonP avatar Apr 16 '21 22:04 MRJasonP

@MRJasonP At the moment, the fastest approach is to download from the Google Cloud bucket. You can find instructions in the Files section of the project on PhysioNet: https://physionet.org/content/mimic-cxr/#files-panel

tompollard avatar Apr 17 '21 04:04 tompollard

@tompollard Thank you for the reply. , I am not very familiar with the Google Cloud bucket platform, but it seems there is still no way to download everything together into a single compressed file?

MRJasonP avatar Apr 17 '21 04:04 MRJasonP

No we haven't created a single ~5 TB file with all of the data. Our hunch was it wasn't a convenient form for downloading the data. If you add -m to the gsutil you can get fairly high bandwidth.

alistairewj avatar May 21 '21 01:05 alistairewj

No we haven't created a single ~5 TB file with all of the data. Our hunch was it wasn't a convenient form for downloading the data. If you add -m to the gsutil you can get fairly high bandwidth.

Thank you for the amazing work, and for sharing the valuable datasets. I have registered as a credential user and followed the instruction to download the dataset. I would like to ask when I use the wget command to download, I get a file not in jpg format but in html format. I am not very familiar with GCP and I get BadRequestException: 400 error when I use gsutil command at that time.

Thank you!

Yaozuwu avatar Jun 22 '21 03:06 Yaozuwu

Thank you for the amazing work, and for sharing the valuable datasets. I have registered as a credential user and followed the instruction to download the dataset. I would like to ask when I use the wget command to download, I get a file not in jpg format but in html format.

Eventually it will download the JPGs. Unfortunately wget downloads all the HTML files first, and as such is very inefficient. There isn't a way to work around this limitation of wget as far as I know (the filter files by extension option of wget deletes files after download, so it doesn't help much).

I am not very familiar with GCP and I get BadRequestException: 400 error when I use gsutil command at that time.

Maybe open a new issue for this.

alistairewj avatar Jun 29 '21 16:06 alistairewj

I am not very familiar with GCP and I get BadRequestException: 400 error when I use gsutil command at that time.

Unfortunately we have recently had to switch downloads from Google Cloud to "Requestor Pays" because we were no longer able to cover the costs for users worldwide. I'm hoping that this is a temporary measure.

Most likely you are seeing the BadRequestException because you have not configured your gsutil project id, which is needed for billing purposes. If you add a project id, you should be able to download the data from Google Cloud.

tompollard avatar Jun 29 '21 16:06 tompollard