wp2txt
wp2txt copied to clipboard
A command-line toolkit to extract text content and category data from Wikipedia dump files
Version: 0.9.1 Hi, I found some extracted titles wrong, and which seems to occur occasionally. To reproduce the bug: Run `wp2txt` twice like below. (The dump file I used is...
Getting below error. Not sure whats the issue. c:\Users\gopal\Downloads>wp2txt -i enwiki-20190820.bz2 -o wikitxt [DEPRECATION] This gem has been renamed to optimist and will no longer be supported. Please switch to...
can we extract only one page or some a few specified pages instead of processing millions of pages?
I am using a google cloud machine so prefer not to use up too much disk space with docker. I am running CentOS 8.
Hi, Im getting Segmentation fault when extracting enwiki. CPU: ```processor : 31 vendor_id : AuthenticAMD cpu family : 25 model : 33 model name : AMD Ryzen 9 5950X 16-Core...