wikiextractor icon indicating copy to clipboard operation
wikiextractor copied to clipboard

EOFError: Ran out of input

Open shidaide2019 opened this issue 4 years ago • 6 comments

Sorry to disturb you, but I met a weird bug while extracting wiki bz2 My python version is 3.8, and anaconda version id 2020.11, I used pip install to get wikiextractor(3.0.4) and when I ran command

python -m wikiextractor.WikiExtractor -o extracted enwiki-20201220-pages-articles-multistream.xml.bz2

It comes out such error message after about 50 mins running:

Traceback (most recent call last): File "C:\Users\win\Anaconda3\lib\runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\win\Anaconda3\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\win\Anaconda3\lib\site-packages\wikiextractor\WikiExtractor.py", line 621, in main() File "C:\Users\win\Anaconda3\lib\site-packages\wikiextractor\WikiExtractor.py", line 616, in main process_dump(input_file, args.templates, output_path, file_size, File "C:\Users\win\Anaconda3\lib\site-packages\wikiextractor\WikiExtractor.py", line 357, in process_dump reduce.start() File "C:\Users\win\Anaconda3\lib\multiprocessing\process.py", line 121, in start self._popen = self._Popen(self) File "C:\Users\win\Anaconda3\lib\multiprocessing\context.py", line 224, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "C:\Users\win\Anaconda3\lib\multiprocessing\context.py", line 327, in _Popen return Popen(process_obj) File "C:\Users\win\Anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 93, in init reduction.dump(process_obj, to_child) File "C:\Users\win\Anaconda3\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) TypeError: cannot pickle '_io.TextIOWrapper' object Traceback (most recent call last): File "", line 1, in File "C:\Users\win\Anaconda3\lib\multiprocessing\spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "C:\Users\win\Anaconda3\lib\multiprocessing\spawn.py", line 126, in _main self = reduction.pickle.load(from_parent) EOFError: Ran out of input

I'm looking forward to your answer.

shidaide2019 avatar Jan 01 '21 08:01 shidaide2019

At first I think it maybe result from multiprocessing ,so I changed the processes to 1 and the error is the same as multiprocess.

shidaide2019 avatar Jan 01 '21 08:01 shidaide2019

i met the same problem,have you solved it?

runpingzhong avatar Jan 19 '21 03:01 runpingzhong

I think there is a problems on Windows, passing file descriptors across threads. It would require some rewriting in order to open descriptors within threads.

attardi avatar Feb 11 '21 11:02 attardi

I have the same problem on Windows ,too, how to solve it?

ArlanCooper avatar Jul 30 '21 00:07 ArlanCooper

Same error. Of course it takes it like 30 mins or so to even reach the failure point in the code.

number435398 avatar Apr 06 '23 02:04 number435398

This issue isn't easily solvable as wikiextractor relies on multiprocessing module and forking mechanism in order to create new processes instead of spawn that's available by Windows.

Your best option is to use WSL environment if you want to use officially distributed package. If you have to stick to Windows then you can try to use my quick patch for Windows support: https://github.com/attardi/wikiextractor/pull/315

However, this patch basically moves all logic from multiprocessing to multithreading - which has abysmal performance in comparison to mp due to GIL - almost linearly slower depending on your CPU count. That being said at least it works. Extraction speed is at about 150 articles/s.

rgryta avatar Jun 04 '23 12:06 rgryta