mimic-code
mimic-code copied to clipboard
Faster way to download the dataset?
Thank you for the amazing work, and for sharing the valuable datasets. I have registered as a credential user and followed the instruction to download the dataset by "wget -r -N -c -np --use ...." command. I am able to download the dataset, but it seems it is downloading as a directory (image by image, file by file, folder by folder), it is too slow given the total dataset size is 4.6TB. I am wondering if there is a link/method that I can download everything in a single zip file, I believe it will be much faster and more convenient.
Thank you
@MRJasonP At the moment, the fastest approach is to download from the Google Cloud bucket. You can find instructions in the Files section of the project on PhysioNet: https://physionet.org/content/mimic-cxr/#files-panel
@tompollard Thank you for the reply. , I am not very familiar with the Google Cloud bucket platform, but it seems there is still no way to download everything together into a single compressed file?
No we haven't created a single ~5 TB file with all of the data. Our hunch was it wasn't a convenient form for downloading the data. If you add -m
to the gsutil you can get fairly high bandwidth.
No we haven't created a single ~5 TB file with all of the data. Our hunch was it wasn't a convenient form for downloading the data. If you add
-m
to the gsutil you can get fairly high bandwidth.
Thank you for the amazing work, and for sharing the valuable datasets. I have registered as a credential user and followed the instruction to download the dataset. I would like to ask when I use the wget command to download, I get a file not in jpg format but in html format. I am not very familiar with GCP and I get BadRequestException: 400 error when I use gsutil command at that time.
Thank you!
Thank you for the amazing work, and for sharing the valuable datasets. I have registered as a credential user and followed the instruction to download the dataset. I would like to ask when I use the wget command to download, I get a file not in jpg format but in html format.
Eventually it will download the JPGs. Unfortunately wget downloads all the HTML files first, and as such is very inefficient. There isn't a way to work around this limitation of wget as far as I know (the filter files by extension option of wget deletes files after download, so it doesn't help much).
I am not very familiar with GCP and I get BadRequestException: 400 error when I use gsutil command at that time.
Maybe open a new issue for this.
I am not very familiar with GCP and I get BadRequestException: 400 error when I use gsutil command at that time.
Unfortunately we have recently had to switch downloads from Google Cloud to "Requestor Pays" because we were no longer able to cover the costs for users worldwide. I'm hoping that this is a temporary measure.
Most likely you are seeing the BadRequestException because you have not configured your gsutil project id, which is needed for billing purposes. If you add a project id, you should be able to download the data from Google Cloud.