wikiextractor
wikiextractor copied to clipboard
EOFError: Ran out of input
Sorry to disturb you, but I met a weird bug while extracting wiki bz2 My python version is 3.8, and anaconda version id 2020.11, I used pip install to get wikiextractor(3.0.4) and when I ran command
python -m wikiextractor.WikiExtractor -o extracted enwiki-20201220-pages-articles-multistream.xml.bz2
It comes out such error message after about 50 mins running:
Traceback (most recent call last):
File "C:\Users\win\Anaconda3\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\win\Anaconda3\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\win\Anaconda3\lib\site-packages\wikiextractor\WikiExtractor.py", line 621, in
I'm looking forward to your answer.
At first I think it maybe result from multiprocessing ,so I changed the processes to 1 and the error is the same as multiprocess.
i met the same problem,have you solved it?
I think there is a problems on Windows, passing file descriptors across threads. It would require some rewriting in order to open descriptors within threads.
I have the same problem on Windows ,too, how to solve it?
Same error. Of course it takes it like 30 mins or so to even reach the failure point in the code.
This issue isn't easily solvable as wikiextractor relies on multiprocessing module and forking mechanism in order to create new processes instead of spawn that's available by Windows.
Your best option is to use WSL environment if you want to use officially distributed package. If you have to stick to Windows then you can try to use my quick patch for Windows support: https://github.com/attardi/wikiextractor/pull/315
However, this patch basically moves all logic from multiprocessing to multithreading - which has abysmal performance in comparison to mp due to GIL - almost linearly slower depending on your CPU count. That being said at least it works. Extraction speed is at about 150 articles/s.