mwoffliner Improve fault tolerance if slight errors in --articleList

Improve fault tolerance if slight errors in --articleList

Open kelson42 opened this issue 3 years ago • 4 comments

Having the following list of articles:

$ cat articles 
/dev/zero
Device_file

Scraping it:

mw --mwUrl="https://en.wikipedia.org" --articleList=articles

We remark the the welcome page list miss the article "Research Unix"... but the article is in the ZIM.

The reason is the space beetween "Research" and "Unix".

What happens is that Wikipedia Mediawiki is configured to replace automatically the space with an underscore (but this is not always configured that way). Therefore it scrapes the article properly but the articleId scraped (with underscore) does not match anymore what is in the list. As a consequence, at the time the mainpage list is created, based on the given articleId given, nothing can be found on Redis, and the article is skipped.

The solution is probably to somehow correct the original articleId list and replace the "faulty" articleId "Research Unix" by "Research_Unix". As far as I can see, the most elegant approach is to have getArticlesByIds() returning a promise with a hash of all articles which have not been found or found by with an other articleId. That way a further treatment could be done and this edge case could be handled properly.

Apr 10 '21 15:04 kelson42

@MananJethwani I have discovered this by working on #1434 (but the problem has always existed and was somehow known to me). Now that we have fixed most of the problems related to the --articleList (in particular around the redirect handling). Maybe we could have a look to this one?

Apr 10 '21 15:04 kelson42

@kelson42 sure we can start working on this issue as well in some time

Apr 14 '21 15:04 MananJethwani

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

Jun 16 '21 22:06 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

Sep 21 '22 03:09 stale[bot]

Meanwhile pretty against being that tolerant.

Mar 05 '23 11:03 kelson42

mwoffliner mwoffliner copied to clipboard

Improve fault tolerance if slight errors in --articleList

mwoffliner
mwoffliner copied to clipboard