dolma icon indicating copy to clipboard operation
dolma copied to clipboard

make_wikipedia.py: long running time

Open chschroeder opened this issue 4 months ago • 3 comments

Hi, Thank you for sharing this outstanding repository!

I have been trying to use scripts/make_wikipedia_py to process a German wikipedia dump:

python scripts/make_wikipedia.py --output wikipedia --lang de  --date 20240201 --processes 16

Unfortunately, it has been running for several days and judging from the outputs it seems to have made only little progress if I interpret the output correctly:

[...]
WARNING:root:Template errors in article 'Buckenhof' (395836): title(0) recursion(96, 0, 0)
WARNING:root:Template errors in article 'Imsterberg' (395533): title(0) recursion(7929961, 0, 0)
WARNING:root:Template errors in article 'Spardorf' (395843): title(0) recursion(96, 0, 0)
WARNING:root:Template errors in article 'Marloffstein' (395848): title(0) recursion(96, 0, 0)
WARNING:root:Template errors in article 'Karres' (395572): title(0) recursion(7929961, 0, 0)
[...]

At this speed, it would take weeks to complete. Using htop I can see that all processes are busy, so I don't think that this is a multiprocessing problem (#58), however, I am also running it on a Linux machine.

This is likely a problem of the underlying wikiextractor library, but since there seems to be little to no activity and I am interested in your experience of using this script. Is it normal for this to take so long?

chschroeder avatar Feb 13 '24 10:02 chschroeder