mwoffliner icon indicating copy to clipboard operation
mwoffliner copied to clipboard

Multiple problems in scraping of multimedia content

Open kelson42 opened this issue 6 years ago • 10 comments

I have created a small test about multiple type of content and ways to include them, but everything is standard. It is available here https://en.m.wikipedia.org/wiki/User:Kelson/MWoffliner_CI_reference.

I have scarped it with 1.9.4 and this was a bit disappointing. We have a here many problems, most of them being that the content is simply not made available. I think such a page should be really tested properly to secure that we don't have anymore big problem around multimedia content displaying.

kelson42 avatar Jul 10 '19 14:07 kelson42

Many things are broken broken because of the keepEmptyParagraphs issue, fixed in https://github.com/openzim/mwoffliner/pull/886

ISNIT0 avatar Jul 12 '19 08:07 ISNIT0

Just merged a few pull requests and things are looking much better :)

ISNIT0 avatar Jul 12 '19 08:07 ISNIT0

@ISNIT0 We need automated tests for this multimedia scraping... I don't count the number of tickets I have open in the past for multimedia content not mirrored properly... and I had to open one a week ago. I don't want to open new ones in the future. This has to be secured.

BTW, I'm quite sure there is way to inject wikicode to the parsoid/MSC API and get the HTML back. So the automated tests should use that instead of starting directly from HTML (which offer no garanty that this is the kind of HTML that the Mediawiki - still - deliver).

kelson42 avatar Jul 15 '19 15:07 kelson42

Testing this is not in 1.9 or 2.0

ISNIT0 avatar Jul 18 '19 13:07 ISNIT0

https://en.wikipedia.org/api/rest_v1/#/Transforms/post_transform_wikitext_to_html

kelson42 avatar Jul 19 '19 10:07 kelson42

I will have a look in detail to that ticket to see if it works now.

kelson42 avatar Aug 02 '19 08:08 kelson42

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] avatar Oct 01 '19 08:10 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] avatar Jun 08 '20 10:06 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] avatar May 28 '23 15:05 stale[bot]

@benoit74 I think we should reassess this issue. The fix is foreseen for 2.0.0, but actually I believe it would be good to reasses earlier to see:

  • we still have the problems
  • the impact is still big enough so we have to fix in 1.17.0
  • we could close/merge https://github.com/openzim/mwoffliner/pull/1821

kelson42 avatar Jul 20 '25 11:07 kelson42