acl-anthology
acl-anthology copied to clipboard
[Bug report] Hash mismatch error for file
Encountering error, "ERROR Hash mismatch for file /acl-anthology/bin/../build/anthology-files/pdf/emoji/2022.emoji-1.0.pdf, downloaded from https://aclanthology.org/2022.emoji-1.0.pdf. was f30c68cc should be 36bb53fe" for several pdf files while generating the anthology via make mirror command.
@niranjanaunnithan Can you elaborate? I don't fully understand the bug. The xml file has the correct hash for pdf file https://aclanthology.org/2022.emoji-1.0.pdf
Hi, we were trying to download the pdf files from acl-anthology by using the make mirror
command. On examining the log file, we came across the following error multiple times.
ERROR Hash mismatch for file /acl-anthology/bin/../build/anthology-files/pdf/emoji/2022.emoji-1.0.pdf, downloaded from https://aclanthology.org/2022.emoji-1.0.pdf. was f30c68cc should be 36bb53fe
We also observed that only 74234 pdf files were downloaded. Please find a screenshot of a portion of the logs.
data:image/s3,"s3://crabby-images/29c76/29c76101e087b141445e746267d5cb3bb89fa438" alt="image"
Regarding this:
We also observed that only 74234 pdf files were downloaded
Not all papers have PDFs that can be downloaded, so this is totally fine.
Regarding the hash mismatches: this is either due to you not having up to date checkout of the git repository (and the hashes have changed in the meantime), some network problem (is this reproducible when you run it again? The script should only retry the ones that failed before), or a problem on the server side that needs to be addressed.
I checked the emoji one and anthology_utils.compute_hash_from_file('/tmp/2022.emoji-1.0.pdf')
returns 'f30c68cc' (as in your comment) but the XML file also has this hash: https://github.com/acl-org/acl-anthology/blob/7e309c89b81af82cc47194a48b63d31487c69766/data/xml/2022.emoji.xml#L17
So, my guess is that your local repository is outdated -- the hash in our data was changed 11 days ago because the PDFs were updated.
Hi. I followed the steps as suggested and ran the script again after a git pull (This was done on July 21). I am still observing the hash mismatch error in the logs. Please find a snippet of the logs.
Files that could not be downloaded
https://aclanthology.org/P19-2050v1.pdf
Files with checksum mismatch
https://aclanthology.org/1991.tc-1.1.pdf https://aclanthology.org/2006.amta-panels.0.pdf https://aclanthology.org/2006.amta-panels.1.pdf https://aclanthology.org/2006.amta-panels.2.pdf https://aclanthology.org/2006.amta-panels.3.pdf https://aclanthology.org/2006.amta-panels.4.pdf https://aclanthology.org/2006.amta-panels.5.pdf https://aclanthology.org/2017.iwslt-1.0.pdf https://aclanthology.org/2021.acl-long.79.pdf https://aclanthology.org/2021.acl-srw.16.pdf https://aclanthology.org/2021.acl-long.79v2.pdf https://aclanthology.org/2021.americasnlp-1.pdf https://aclanthology.org/2021.autosimtrans-1.pdf https://aclanthology.org/2021.calcs-1.pdf https://aclanthology.org/2021.clpsych-1.pdf https://aclanthology.org/2021.cmcl-1.pdf https://aclanthology.org/2021.dash-1.pdf https://aclanthology.org/2021.deelio-1.pdf https://aclanthology.org/2021.emnlp-main.300.pdf https://aclanthology.org/2021.emnlp-main.409.pdf https://aclanthology.org/2021.emnlp-main.824.pdf https://aclanthology.org/2021.motra-1.0.pdf https://aclanthology.org/2021.mrl-1.5v1.pdf https://aclanthology.org/2021.mtsummit-up.pdf https://aclanthology.org/2021.naacl-demos.pdf https://aclanthology.org/2021.naacl-srw.pdf https://aclanthology.org/2021.naacl-tutorials.pdf https://aclanthology.org/2021.naacl-industry.pdf https://aclanthology.org/2021.naacl-main.189.pdf https://aclanthology.org/2021.nlp4if-1.pdf https://aclanthology.org/2021.nlpmc-1.pdf https://aclanthology.org/2021.privatenlp-1.pdf https://aclanthology.org/2021.sdp-1.pdf https://aclanthology.org/2021.smm4h-1.pdf https://aclanthology.org/2021.socialnlp-1.pdf https://aclanthology.org/2021.splurobonlp-1.pdf https://aclanthology.org/2021.teachingnlp-1.pdf https://aclanthology.org/2021.textgraphs-1.pdf https://aclanthology.org/2021.trustnlp-1.pdf https://aclanthology.org/2021.vigil-1.pdf https://aclanthology.org/2021.wmt-1.73.pdf https://aclanthology.org/2022.acl-long.52.pdf https://aclanthology.org/2022.iwslt-1.9.pdf https://aclanthology.org/2022.repl4nlp-1.pdf
Hi. Is there any update on this issue? I am still experiencing this even with a fresh clone of the latest master branch.