wikiextractor
wikiextractor copied to clipboard
cannot serialize/pickle '_io.TextIOWrapper' object
Input file is https://dumps.wikimedia.org/kowiki/latest/kowiki-latest-pages-articles.xml.bz2 . Environment is like below:
- Windows 10 21H1 (build 19043.1165)
- run in two Python versions, both in Windows Terminal
- Python 3.7.4 on PowerShell ("Env 1")
- Python 3.9.5 on Anaconda PowerShell ("Env 2")
- at both versions of Python, command line was
python -m wikiextractor.WikiExtractor ..\assets\kowiki-latest-pages-articles.xml.bz2 -o ..\assets\kowiki-dump\
Output at Env 1:
PS C:\Users\User\Downloads\pycharm\7-speech-to-text\wikiextractor-3.0.4> python -m wikiextractor.WikiExtractor ..\assets\kowiki-latest-pages-articles.xml.bz2 -o ..\assets\kowiki-dump\
INFO: Preprocessing '..\assets\kowiki-latest-pages-articles.xml.bz2' to collect template definitions: this may take some time.
INFO: Preprocessed 100000 pages
INFO: Preprocessed 200000 pages
INFO: Preprocessed 300000 pages
INFO: Preprocessed 400000 pages
INFO: Preprocessed 500000 pages
INFO: Preprocessed 600000 pages
INFO: Preprocessed 700000 pages
INFO: Preprocessed 800000 pages
INFO: Preprocessed 900000 pages
INFO: Preprocessed 1000000 pages
INFO: Preprocessed 1100000 pages
INFO: Preprocessed 1200000 pages
INFO: Preprocessed 1300000 pages
INFO: Preprocessed 1400000 pages
INFO: Preprocessed 1500000 pages
INFO: Preprocessed 1600000 pages
INFO: Loaded 56777 templates in 291.7s
INFO: Starting page extraction from ..\assets\kowiki-latest-pages-articles.xml.bz2.
Traceback (most recent call last):
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\User\Downloads\pycharm\7-speech-to-text\wikiextractor-3.0.4\wikiextractor\WikiExtractor.py", line 621, in <module>
main()
File "C:\Users\User\Downloads\pycharm\7-speech-to-text\wikiextractor-3.0.4\wikiextractor\WikiExtractor.py", line 617, in main
args.compress, args.processes)
File "C:\Users\User\Downloads\pycharm\7-speech-to-text\wikiextractor-3.0.4\wikiextractor\WikiExtractor.py", line 357, in process_dump
reduce.start()
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\popen_spawn_win32.py", line 89, in __init__
reduction.dump(process_obj, to_child)
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: cannot serialize '_io.TextIOWrapper' object
PS C:\Users\User\Downloads\pycharm\7-speech-to-text\wikiextractor-3.0.4> Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\spawn.py", line 99, in spawn_main
new_handle = reduction.steal_handle(parent_pid, pipe_handle)
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\reduction.py", line 87, in steal_handle
_winapi.DUPLICATE_SAME_ACCESS | _winapi.DUPLICATE_CLOSE_SOURCE)
PermissionError: [WinError 5] 액세스가 거부되었습니다
Output at Env 2:
(DGAIS2021) PS C:\Users\User\Downloads\pycharm\7-speech-to-text\wikiextractor-3.0.4> python -m wikiextractor.WikiExtractor ..\assets\kowiki-latest-pages-articles.xml.bz2 -o ..\assets\kowiki-dump\
INFO: Preprocessing '..\assets\kowiki-latest-pages-articles.xml.bz2' to collect template definitions: this may take some time.
INFO: Preprocessed 100000 pages
INFO: Preprocessed 200000 pages
INFO: Preprocessed 300000 pages
INFO: Preprocessed 500000 pages
INFO: Preprocessed 600000 pages
INFO: Preprocessed 700000 pages
INFO: Preprocessed 800000 pages
INFO: Preprocessed 900000 pages
INFO: Preprocessed 1000000 pages
INFO: Preprocessed 1100000 pages
INFO: Preprocessed 1200000 pages
INFO: Preprocessed 1300000 pages
INFO: Preprocessed 1400000 pages
INFO: Preprocessed 1500000 pages
INFO: Preprocessed 1600000 pages
INFO: Loaded 56777 templates in 219.2s
INFO: Starting page extraction from ..\assets\kowiki-latest-pages-articles.xml.bz2.
Traceback (most recent call last):
File "C:\Users\User\.conda\envs\DGAIS2021\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\User\.conda\envs\DGAIS2021\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\User\Downloads\pycharm\7-speech-to-text\wikiextractor-3.0.4\wikiextractor\WikiExtractor.py", line 621, in <module>
main()
File "C:\Users\User\Downloads\pycharm\7-speech-to-text\wikiextractor-3.0.4\wikiextractor\WikiExtractor.py", line 616, in main
process_dump(input_file, args.templates, output_path, file_size,
File "C:\Users\User\Downloads\pycharm\7-speech-to-text\wikiextractor-3.0.4\wikiextractor\WikiExtractor.py", line 357, in process_dump
reduce.start()
File "C:\Users\User\.conda\envs\DGAIS2021\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Users\User\.conda\envs\DGAIS2021\lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\User\.conda\envs\DGAIS2021\lib\multiprocessing\context.py", line 327, in _Popen
return Popen(process_obj)
File "C:\Users\User\.conda\envs\DGAIS2021\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
reduction.dump(process_obj, to_child)
File "C:\Users\User\.conda\envs\DGAIS2021\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle '_io.TextIOWrapper' object
(DGAIS2021) PS C:\Users\User\Downloads\pycharm\7-speech-to-text\wikiextractor-3.0.4> Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\User\.conda\envs\DGAIS2021\lib\multiprocessing\spawn.py", line 107, in spawn_main
new_handle = reduction.duplicate(pipe_handle,
File "C:\Users\User\.conda\envs\DGAIS2021\lib\multiprocessing\reduction.py", line 79, in duplicate
return _winapi.DuplicateHandle(
PermissionError: [WinError 5] 액세스가 거부되었습니다
In both outputs the last PermissionError
message reads "Access is denied."
These were not helping at all:
- using a PowerShell window with admin permissions
- not setting an output folder
- replacing the BZ2 with the only XML file extracted from it
- running
python setup.py install
and trying again
The two outputs are almost the same, but some are different: the most significant one I think is TypeError: cannot serialize '_io.TextIOWrapper' object
from Env 1 vs. TypeError: cannot pickle '_io.TextIOWrapper' object
from Env 2.
Works on linux
- Ubuntu 20.04.2 LTS 64bit
- python 3.8.5
I have encounter the same issue.
- Windows 10 Home 20H2 (build 19042.1110)
- Python 3.9.7 via scoop
- command line was
wikiextractor .\jawiki-20210901-pages-articles6.xml-p4307948p4444230.bz2
(base) PS > wikiextractor .\jawiki-20210901-pages-articles6.xml-p4307948p4444230.bz2
INFO: Preprocessing '.\jawiki-20210901-pages-articles6.xml-p4307948p4444230.bz2' to collect template definitions: this mINFO: Loaded 3237 templates in 17.9s
Traceback (most recent call last):
File "C:\Users\skytomo\scoop\apps\python\current\Scripts\wikiextractor-script.py", line 33, in <module>
sys.exit(load_entry_point('wikiextractor==3.0.5', 'console_scripts', 'wikiextractor')())
File "C:\Users\skytomo\scoop\apps\python\current\lib\site-packages\wikiextractor-3.0.5-py3.9.egg\wikiextractor\WikiExtractor.py", line 636, in main
File "C:\Users\skytomo\scoop\apps\python\current\lib\site-packages\wikiextractor-3.0.5-py3.9.egg\wikiextractor\WikiExtractor.py", line 364, in process_dump
File "C:\Users\skytomo\scoop\apps\python\current\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Users\skytomo\scoop\apps\python\current\lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\skytomo\scoop\apps\python\current\lib\multiprocessing\context.py", line 327, in _Popen
return Popen(process_obj)
File "C:\Users\skytomo\scoop\apps\python\current\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
reduction.dump(process_obj, to_child)
File "C:\Users\skytomo\scoop\apps\python\current\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle '_io.TextIOWrapper' object
(base) PS D:\skytomo\Documents\何らかのディレクトリ> Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\skytomo\scoop\apps\python\current\lib\multiprocessing\spawn.py", line 107, in spawn_main
new_handle = reduction.duplicate(pipe_handle,
File "C:\Users\skytomo\scoop\apps\python\current\lib\multiprocessing\reduction.py", line 79, in duplicate
return _winapi.DuplicateHandle(
PermissionError: [WinError 5] アクセスが拒否されました。
On a mac with python 3.8 same error. Not a windows issue
Same error here too with macOS BigSur 20G165 and python 3.8.11
Works fine with macOS BigSur 20G165 and python 3.7.11