acl-anthology icon indicating copy to clipboard operation
acl-anthology copied to clipboard

[Bug report] Hash mismatch error for file

Open niranjanaunnithan opened this issue 2 years ago • 5 comments

Encountering error, "ERROR Hash mismatch for file /acl-anthology/bin/../build/anthology-files/pdf/emoji/2022.emoji-1.0.pdf, downloaded from https://aclanthology.org/2022.emoji-1.0.pdf. was f30c68cc should be 36bb53fe" for several pdf files while generating the anthology via make mirror command.

niranjanaunnithan avatar Jul 18 '22 05:07 niranjanaunnithan

@niranjanaunnithan Can you elaborate? I don't fully understand the bug. The xml file has the correct hash for pdf file https://aclanthology.org/2022.emoji-1.0.pdf

xinru1414 avatar Jul 21 '22 01:07 xinru1414

Hi, we were trying to download the pdf files from acl-anthology by using the make mirror command. On examining the log file, we came across the following error multiple times.

ERROR Hash mismatch for file /acl-anthology/bin/../build/anthology-files/pdf/emoji/2022.emoji-1.0.pdf, downloaded from https://aclanthology.org/2022.emoji-1.0.pdf. was f30c68cc should be 36bb53fe

We also observed that only 74234 pdf files were downloaded. Please find a screenshot of a portion of the logs.

image

niranjanaunnithan avatar Jul 21 '22 04:07 niranjanaunnithan

Regarding this:

We also observed that only 74234 pdf files were downloaded

Not all papers have PDFs that can be downloaded, so this is totally fine.

Regarding the hash mismatches: this is either due to you not having up to date checkout of the git repository (and the hashes have changed in the meantime), some network problem (is this reproducible when you run it again? The script should only retry the ones that failed before), or a problem on the server side that needs to be addressed.

I checked the emoji one and anthology_utils.compute_hash_from_file('/tmp/2022.emoji-1.0.pdf') returns 'f30c68cc' (as in your comment) but the XML file also has this hash: https://github.com/acl-org/acl-anthology/blob/7e309c89b81af82cc47194a48b63d31487c69766/data/xml/2022.emoji.xml#L17

So, my guess is that your local repository is outdated -- the hash in our data was changed 11 days ago because the PDFs were updated.

akoehn avatar Jul 21 '22 07:07 akoehn

Hi. I followed the steps as suggested and ran the script again after a git pull (This was done on July 21). I am still observing the hash mismatch error in the logs. Please find a snippet of the logs.

Files that could not be downloaded

https://aclanthology.org/P19-2050v1.pdf

Files with checksum mismatch

https://aclanthology.org/1991.tc-1.1.pdf https://aclanthology.org/2006.amta-panels.0.pdf https://aclanthology.org/2006.amta-panels.1.pdf https://aclanthology.org/2006.amta-panels.2.pdf https://aclanthology.org/2006.amta-panels.3.pdf https://aclanthology.org/2006.amta-panels.4.pdf https://aclanthology.org/2006.amta-panels.5.pdf https://aclanthology.org/2017.iwslt-1.0.pdf https://aclanthology.org/2021.acl-long.79.pdf https://aclanthology.org/2021.acl-srw.16.pdf https://aclanthology.org/2021.acl-long.79v2.pdf https://aclanthology.org/2021.americasnlp-1.pdf https://aclanthology.org/2021.autosimtrans-1.pdf https://aclanthology.org/2021.calcs-1.pdf https://aclanthology.org/2021.clpsych-1.pdf https://aclanthology.org/2021.cmcl-1.pdf https://aclanthology.org/2021.dash-1.pdf https://aclanthology.org/2021.deelio-1.pdf https://aclanthology.org/2021.emnlp-main.300.pdf https://aclanthology.org/2021.emnlp-main.409.pdf https://aclanthology.org/2021.emnlp-main.824.pdf https://aclanthology.org/2021.motra-1.0.pdf https://aclanthology.org/2021.mrl-1.5v1.pdf https://aclanthology.org/2021.mtsummit-up.pdf https://aclanthology.org/2021.naacl-demos.pdf https://aclanthology.org/2021.naacl-srw.pdf https://aclanthology.org/2021.naacl-tutorials.pdf https://aclanthology.org/2021.naacl-industry.pdf https://aclanthology.org/2021.naacl-main.189.pdf https://aclanthology.org/2021.nlp4if-1.pdf https://aclanthology.org/2021.nlpmc-1.pdf https://aclanthology.org/2021.privatenlp-1.pdf https://aclanthology.org/2021.sdp-1.pdf https://aclanthology.org/2021.smm4h-1.pdf https://aclanthology.org/2021.socialnlp-1.pdf https://aclanthology.org/2021.splurobonlp-1.pdf https://aclanthology.org/2021.teachingnlp-1.pdf https://aclanthology.org/2021.textgraphs-1.pdf https://aclanthology.org/2021.trustnlp-1.pdf https://aclanthology.org/2021.vigil-1.pdf https://aclanthology.org/2021.wmt-1.73.pdf https://aclanthology.org/2022.acl-long.52.pdf https://aclanthology.org/2022.iwslt-1.9.pdf https://aclanthology.org/2022.repl4nlp-1.pdf

niranjanaunnithan avatar Jul 26 '22 18:07 niranjanaunnithan

Hi. Is there any update on this issue? I am still experiencing this even with a fresh clone of the latest master branch.

niranjanaunnithan avatar Aug 08 '22 05:08 niranjanaunnithan