wikiextractor
wikiextractor copied to clipboard
Allow wikiextractor to leave out certain page id's
I recently had the problem, that after several hours of processing, wikiextractor threw an error. I changed a few lines to filter out the already processed page id's in order to continue where I left off. Currently, there is at least one other issues, that would benefit from such a solution (and is very similiar to mine): 136 In order to apply the solution, I would do the following: Add another input parameter: --page_ids which defaults to [0,infinity] and can be adjusted in the following format [start_id,end_id].