course-nlp icon indicating copy to clipboard operation
course-nlp copied to clipboard

Problem with get_wiki (I think because of possible changes to wiki_extractor)

Open muralits98 opened this issue 5 years ago • 5 comments

I am trying to rerun the https://github.com/fastai/course-nlp/blob/master/nn-vietnamese.ipynb Vietnamese notebook and am getting the file not found error at

get_wiki(path,lang)

This seems to be the case with any language. A manual check revealed that the text directory did not have an AA\wiki_00.

I don't know what the problem here is.

FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\\.fastai\data\viwiki\text\AA\wiki_00'

muralits98 avatar Jun 03 '20 23:06 muralits98

@muralits98 hey, I'm encountering the same error. Did you manage to solve it?

prats0599 avatar Oct 06 '20 10:10 prats0599

okay i got it to work. I installed the package via pip and commented the line if not (path/'wikiextractor').exists(): os.system('git clone https://github.com/attardi/wikiextractor.git') in nlputils.py. I then changed the line os.system("python wikiextractor/WikiExtractor.py... to os.system("python -m wikiextractor.WikiExtractor .. and voila! Commenting it here incase anyone encounters the same problem.

prats0599 avatar Oct 07 '20 10:10 prats0599

@prats0599 I have done your suggestions, yet I have the same error: No such file or directory: '/root/.fastai/data/frwiki/text/AA/wiki_00' any other hints?

royam0820 avatar Apr 13 '21 10:04 royam0820

I also faced the same problem with ru language: No such file or directory: '/root/.fastai/data/ruwiki/text/AA/wiki_00' -> '/root/.fastai/data/ruwiki/ruwiki' Can anyone help?

uribah avatar May 08 '21 15:05 uribah

Hi mates, this one should work. We need to update the options when call WikiExtractor at get_wiki(path,lang) function, in nlputils.py file: From: os.system("python wikiextractor/WikiExtractor.py --processes 4 --no_templates " + f"--min_text_length 1800 --filter_disambig_pages --log_file log -b 100G -q {xml_fn}") To os.system("python -m wikiextractor.wikiextractor.WikiExtractor --no-templates -b 100G -q " + f"{xml_fn}")

This is due to the argument update at https://github.com/attardi/wikiextractor/blob/master/wikiextractor/WikiExtractor.py For example: --no_templates change to --no-templates. Besides, other options (such as --min_text_length, --filter_disambig_pages, and --log_file) do not existed anymore.

I have make a PR at https://github.com/fastai/course-nlp/pull/55

danhphan avatar Jun 05 '21 08:06 danhphan