mimic-code Faster way to download the dataset?

Faster way to download the dataset?

Open MRJasonP opened this issue 3 years ago • 6 comments

Thank you for the amazing work, and for sharing the valuable datasets. I have registered as a credential user and followed the instruction to download the dataset by "wget -r -N -c -np --use ...." command. I am able to download the dataset, but it seems it is downloading as a directory (image by image, file by file, folder by folder), it is too slow given the total dataset size is 4.6TB. I am wondering if there is a link/method that I can download everything in a single zip file, I believe it will be much faster and more convenient.

Thank you

Apr 16 '21 22:04 MRJasonP

@MRJasonP At the moment, the fastest approach is to download from the Google Cloud bucket. You can find instructions in the Files section of the project on PhysioNet: https://physionet.org/content/mimic-cxr/#files-panel

Apr 17 '21 04:04 tompollard

@tompollard Thank you for the reply. , I am not very familiar with the Google Cloud bucket platform, but it seems there is still no way to download everything together into a single compressed file?

Apr 17 '21 04:04 MRJasonP

No we haven't created a single ~5 TB file with all of the data. Our hunch was it wasn't a convenient form for downloading the data. If you add -m to the gsutil you can get fairly high bandwidth.

May 21 '21 01:05 alistairewj

No we haven't created a single ~5 TB file with all of the data. Our hunch was it wasn't a convenient form for downloading the data. If you add -m to the gsutil you can get fairly high bandwidth.

Thank you for the amazing work, and for sharing the valuable datasets. I have registered as a credential user and followed the instruction to download the dataset. I would like to ask when I use the wget command to download, I get a file not in jpg format but in html format. I am not very familiar with GCP and I get BadRequestException: 400 error when I use gsutil command at that time.

Thank you!

Jun 22 '21 03:06 Yaozuwu

Thank you for the amazing work, and for sharing the valuable datasets. I have registered as a credential user and followed the instruction to download the dataset. I would like to ask when I use the wget command to download, I get a file not in jpg format but in html format.

Eventually it will download the JPGs. Unfortunately wget downloads all the HTML files first, and as such is very inefficient. There isn't a way to work around this limitation of wget as far as I know (the filter files by extension option of wget deletes files after download, so it doesn't help much).

I am not very familiar with GCP and I get BadRequestException: 400 error when I use gsutil command at that time.

Maybe open a new issue for this.

Jun 29 '21 16:06 alistairewj

I am not very familiar with GCP and I get BadRequestException: 400 error when I use gsutil command at that time.

Unfortunately we have recently had to switch downloads from Google Cloud to "Requestor Pays" because we were no longer able to cover the costs for users worldwide. I'm hoping that this is a temporary measure.

Most likely you are seeing the BadRequestException because you have not configured your gsutil project id, which is needed for billing purposes. If you add a project id, you should be able to download the data from Google Cloud.

Jun 29 '21 16:06 tompollard

mimic-code mimic-code copied to clipboard

Faster way to download the dataset?

mimic-code
mimic-code copied to clipboard