head-qa icon indicating copy to clipboard operation
head-qa copied to clipboard

wiki dump link

Open paper-revise-crypt-12 opened this issue 4 years ago • 4 comments

Hello,

Your work is very interesting. Could you please provide the link to the enwiki-20180701-pages-articles.xml.bz2 dump that you used as I didn't find it on google search.

Thanks.

paper-revise-crypt-12 avatar Nov 19 '19 16:11 paper-revise-crypt-12

Hello, thanks.

It seems https://dumps.wikimedia.org/ does not keep copies of historical (old) dumps (?). I have added a link to my local copy of the mentioned dumps to the README.md, but if the format has not changed an alternative option should be to take the current wikipedia dump.

aghie avatar Nov 19 '19 17:11 aghie

Thanks for your reply!

I'm now trying to run build_db.py on the nested directory of files returned by running WikiExtractor.py on your wiki dump, however I keep gettting this error

"UnicodeDecodeError: 'utf-8' codec can't decode byte 0x86 in position 27: invalid start byte"

Could you please let me know if you know how to fix this?

paper-revise-crypt-12 avatar Nov 20 '19 00:11 paper-revise-crypt-12

Could you specify what dump are you using (Spanish or English) and the commands you are executing, to see if I can reproduce this error? I am not getting it in my current virtualenv.

Also, have you checked that you are using the correct version of Python, dependencies and DrQA that I specify through the README.md and install.sh files? (note that this error comes from DrQA)

aghie avatar Nov 20 '19 14:11 aghie

I'm using the English wiki dump that you uploaded. Yes I've checked all the dependencies and they are the same.

Thanks.

paper-revise-crypt-12 avatar Nov 20 '19 22:11 paper-revise-crypt-12