mwoffliner
mwoffliner copied to clipboard
Improve fault tolerance if slight errors in --articleList
Having the following list of articles:
$ cat articles
/dev/zero
Device_file
Scraping it:
mw --mwUrl="https://en.wikipedia.org" --articleList=articles
We remark the the welcome page list miss the article "Research Unix"... but the article is in the ZIM.
The reason is the space beetween "Research" and "Unix".
What happens is that Wikipedia Mediawiki is configured to replace automatically the space with an underscore (but this is not always configured that way). Therefore it scrapes the article properly but the articleId scraped (with underscore) does not match anymore what is in the list. As a consequence, at the time the mainpage list is created, based on the given articleId given, nothing can be found on Redis, and the article is skipped.
The solution is probably to somehow correct the original articleId list and replace the "faulty" articleId "Research Unix" by "Research_Unix". As far as I can see, the most elegant approach is to have getArticlesByIds()
returning a promise with a hash of all articles which have not been found or found by with an other articleId. That way a further treatment could be done and this edge case could be handled properly.
@MananJethwani I have discovered this by working on #1434 (but the problem has always existed and was somehow known to me). Now that we have fixed most of the problems related to the --articleList (in particular around the redirect handling). Maybe we could have a look to this one?
@kelson42 sure we can start working on this issue as well in some time
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
Meanwhile pretty against being that tolerant.