Is this project abandoned?
It seems while there are many issues and many active users and even pull requests fixing issues, there has been no activity on this project for some time.
Is this project abandoned? Could the original authors maybe make a statement what the best way forward could be from their POV ?
I also think wikiextractor is an important tool for flexible wikipedia processing. It definitely should be maintained.
Hopefully it will be able to be maintained or at least a good fork can be maintained. I have been able to get past a couple of issues by using a different fork, but considering this is the original it would be great for it to work as intended.
❌ What Went Wrong? The error:
lua Copy Edit ValueError: cannot find context for 'fork' …is because the wikiextractor is trying to use the 'fork' multiprocessing method — which doesn’t exist on Windows (only on Linux/macOS). Windows uses 'spawn' instead.
✅ Fix It With This Simple Change: We just need to tell wikiextractor to run in single process mode, or manually choose a Windows-friendly mode.
✅ Option 1: Run it in Single-Threaded Mode (no multiprocessing) bash Copy Edit wikiextractor --json --no-templates --processes 1 --output extracted enwiki-latest-pages-articles.xml.bz2 This avoids any multiprocessing and just runs it all through one process (a bit slower, but guaranteed to work).
✅ Option 2: Use Python to Call the Extractor (Recommended for Control) If you're cool with running a quick Python script instead of the CLI, here's how:
Open a new .py file or open a terminal and paste this:
python Copy Edit from wikiextractor.WikiExtractor import main import sys
sys.argv = [ 'WikiExtractor.py', '--json', '--no-templates', '--output', 'extracted', 'enwiki-latest-pages-articles.xml.bz2' ]
main() Save it as run_extractor.py
Run it like this:
bash Copy Edit py -3.10 run_extractor.py
Inside your extracted/ folder:
Tons of .json or .txt files
Each contains articles in clean, readable format
Perfect for indexing, chunking, and embedding into an LLM search engine. I'm trying to use ai to help me get around this issue so I can train a local model on the wiki dump. I can't change to "spawn" without errors. Tried a direct patch and broke the ai haha
@bdog18 good suggestion. Is there a current fork that we may be able to maintain instead?