MIMIC-CXR-JPG sha256 sum mismatch
Prerequisites
- [X] Put an X between the brackets on this line if you have done all of the following:
- Checked the online documentation: https://mimic.mit.edu/
- Checked that your issue isn't already addressed: https://github.com/MIT-LCP/mimic-code/issues?utf8=%E2%9C%93&q=
Description
I downloaded the MIMIC-CXR-JPG dataset from google cloud storage.
When I go to verify the sha256 sum, I find the following mismatches:
$ sha256sum -c SHA256SUMS.txt --quiet
files/p10/p10375986/s59475126/44685902-a2ada121-02735bc5-bf1bf167-adfd2ae5.jpg: FAILED
files/p11/p11131026/s59741822/08c22db9-5bef7d06-d904ec15-7bbfe57f-416dbdc1.jpg: FAILED
files/p11/p11607063/s58298420/235c7af4-ef2ba0dc-7dc251ea-a2571f33-d37c8185.jpg: FAILED
files/p11/p11785297/s58022353/3b64bf5a-021ff5ae-137c22d1-5529364f-1415c640.jpg: FAILED
files/p11/p11920643/s55676416/4d70ff33-43ad77af-22ff047c-19f6ceb1-aae49eea.jpg: FAILED
files/p13/p13283178/s55081421/026de108-3310a177-7c01791c-7eb32cff-b076122f.jpg: FAILED
files/p13/p13628037/s54872639/f845ad66-716c76dd-da718912-8b0ff596-b30d25cb.jpg: FAILED
files/p13/p13694166/s55805720/df57d48e-566984d2-fbe39e6e-0c68fc55-380f1217.jpg: FAILED
files/p14/p14656449/s56499991/67a4e5cd-50d441d3-42294f94-363ac071-17cfc342.jpg: FAILED
files/p14/p14690121/s50057475/34ad06d4-475863f1-f3712cec-783c3b99-308cf886.jpg: FAILED
files/p17/p17405329/s55291678/283084bb-0f4994a7-d7622b32-d7f18f75-d8dde41b.jpg: FAILED
files/p17/p17490145/s55463370/803fcbd8-2e38a5c7-cca96a50-ce5660cb-83ecc3a1.jpg: FAILED
files/p18/p18459824/s52186356/2eb68b2f-0742cb3d-b8c9db5b-9c9d74f9-69e31cc1.jpg: FAILED
files/p18/p18690742/s56844948/f4f63777-6a8a6b60-d6cb0718-9256537a-2ca41831.jpg: FAILED
sha256sum: WARNING: 14 computed checksums did NOT match
I redownloaded the above files, and still the same result.
For eg, if i take the first one:
$ sha256sum files/p10/p10375986/s59475126/44685902-a2ada121-02735bc5-bf1bf167-adfd2ae5.jpg
8a95fb444bdfec8087c49f5fb0742e6674568dd7aca839a30310a6fdb4ff427c files/p10/p10375986/s59475126/44685902-a2ada121-02735bc5-bf1bf167-adfd2ae5.jpg
$ cat SHA256SUMS.txt | grep files/p10/p10375986/s59475126/44685902-a2ada121-02735bc5-bf1bf167-adfd2ae5.jpg
ed9a93b1fd0c9ff7c0601a79c8f6ae91c49b524a1b9a34315e065a830829df1b files/p10/p10375986/s59475126/44685902-a2ada121-02735bc5-bf1bf167-adfd2ae5.jpg
Any ideas on how to further verify what may be causing this?
I downloaded the above files once again from gcs and ran a diff against my original copy, and the images are looking good, no diff. So I guess, we just need to update the SHA256SUMS.txt file.
My script used for the above:
import os
import subprocess
# The sha256 checksum of these images don't match with the ones reported in SHA256SUMS.txt
# so we download them locally and do a diff to ensure they are the same images
# which likely means the sha256 sums need updating
image_paths = [
"files/p10/p10375986/s59475126/44685902-a2ada121-02735bc5-bf1bf167-adfd2ae5.jpg",
"files/p11/p11131026/s59741822/08c22db9-5bef7d06-d904ec15-7bbfe57f-416dbdc1.jpg",
"files/p11/p11607063/s58298420/235c7af4-ef2ba0dc-7dc251ea-a2571f33-d37c8185.jpg",
"files/p11/p11785297/s58022353/3b64bf5a-021ff5ae-137c22d1-5529364f-1415c640.jpg",
"files/p11/p11920643/s55676416/4d70ff33-43ad77af-22ff047c-19f6ceb1-aae49eea.jpg",
"files/p13/p13283178/s55081421/026de108-3310a177-7c01791c-7eb32cff-b076122f.jpg",
"files/p13/p13628037/s54872639/f845ad66-716c76dd-da718912-8b0ff596-b30d25cb.jpg",
"files/p13/p13694166/s55805720/df57d48e-566984d2-fbe39e6e-0c68fc55-380f1217.jpg",
"files/p14/p14656449/s56499991/67a4e5cd-50d441d3-42294f94-363ac071-17cfc342.jpg",
"files/p14/p14690121/s50057475/34ad06d4-475863f1-f3712cec-783c3b99-308cf886.jpg",
"files/p17/p17405329/s55291678/283084bb-0f4994a7-d7622b32-d7f18f75-d8dde41b.jpg",
"files/p17/p17490145/s55463370/803fcbd8-2e38a5c7-cca96a50-ce5660cb-83ecc3a1.jpg",
"files/p18/p18459824/s52186356/2eb68b2f-0742cb3d-b8c9db5b-9c9d74f9-69e31cc1.jpg",
"files/p18/p18690742/s56844948/f4f63777-6a8a6b60-d6cb0718-9256537a-2ca41831.jpg"
]
for image in image_paths:
# download to a temporary directory
subprocess.check_output([
"gcloud", "storage", "--billing-project", "<project-name>", "cp",
f"gs://mimic-cxr-jpg-2.1.0.physionet.org/{image}", f"tmp-check-diff/{os.path.basename(image)}"
])
# check the downloaded version against the one stored locally already
subprocess.check_output([
"diff",
f"tmp-check-diff/{os.path.basename(image)}", f"{image}"
])
It's odd because the SHA256SUMs are calculated automatically by PhysioNet when publishing the files, so I'm not sure why they would be wrong. Could be because we had some custom workarounds for MIMIC-CXR. I will raise with some of the PhysioNet team, thanks!
OK, I think something went wrong with our GCP upload because they were simply different on that bucket. Can you redownload them from the GCP bucket and check again? It should be fixed now.