wikiextractor icon indicating copy to clipboard operation
wikiextractor copied to clipboard

Is this project abandoned?

Open johann-petrak opened this issue 1 year ago • 7 comments

It seems while there are many issues and many active users and even pull requests fixing issues, there has been no activity on this project for some time.

Is this project abandoned? Could the original authors maybe make a statement what the best way forward could be from their POV ?

johann-petrak avatar Jun 26 '24 12:06 johann-petrak

I also think wikiextractor is an important tool for flexible wikipedia processing. It definitely should be maintained.

simon-clematide avatar Feb 26 '25 07:02 simon-clematide

Hopefully it will be able to be maintained or at least a good fork can be maintained. I have been able to get past a couple of issues by using a different fork, but considering this is the original it would be great for it to work as intended.

bdog18 avatar May 13 '25 19:05 bdog18

❌ What Went Wrong? The error:

lua Copy Edit ValueError: cannot find context for 'fork' …is because the wikiextractor is trying to use the 'fork' multiprocessing method — which doesn’t exist on Windows (only on Linux/macOS). Windows uses 'spawn' instead.

✅ Fix It With This Simple Change: We just need to tell wikiextractor to run in single process mode, or manually choose a Windows-friendly mode.

✅ Option 1: Run it in Single-Threaded Mode (no multiprocessing) bash Copy Edit wikiextractor --json --no-templates --processes 1 --output extracted enwiki-latest-pages-articles.xml.bz2 This avoids any multiprocessing and just runs it all through one process (a bit slower, but guaranteed to work).

✅ Option 2: Use Python to Call the Extractor (Recommended for Control) If you're cool with running a quick Python script instead of the CLI, here's how:

Open a new .py file or open a terminal and paste this:

python Copy Edit from wikiextractor.WikiExtractor import main import sys

sys.argv = [ 'WikiExtractor.py', '--json', '--no-templates', '--output', 'extracted', 'enwiki-latest-pages-articles.xml.bz2' ]

main() Save it as run_extractor.py

Run it like this:

bash Copy Edit py -3.10 run_extractor.py

Inside your extracted/ folder:

Tons of .json or .txt files

Each contains articles in clean, readable format

Perfect for indexing, chunking, and embedding into an LLM search engine. I'm trying to use ai to help me get around this issue so I can train a local model on the wiki dump. I can't change to "spawn" without errors. Tried a direct patch and broke the ai haha

FadedSocks avatar Jun 06 '25 19:06 FadedSocks

@bdog18 good suggestion. Is there a current fork that we may be able to maintain instead?

weezymatt avatar Jun 12 '25 20:06 weezymatt