Multiple problems in scraping of multimedia content
I have created a small test about multiple type of content and ways to include them, but everything is standard. It is available here https://en.m.wikipedia.org/wiki/User:Kelson/MWoffliner_CI_reference.
I have scarped it with 1.9.4 and this was a bit disappointing. We have a here many problems, most of them being that the content is simply not made available. I think such a page should be really tested properly to secure that we don't have anymore big problem around multimedia content displaying.
Many things are broken broken because of the keepEmptyParagraphs issue, fixed in https://github.com/openzim/mwoffliner/pull/886
Just merged a few pull requests and things are looking much better :)
@ISNIT0 We need automated tests for this multimedia scraping... I don't count the number of tickets I have open in the past for multimedia content not mirrored properly... and I had to open one a week ago. I don't want to open new ones in the future. This has to be secured.
BTW, I'm quite sure there is way to inject wikicode to the parsoid/MSC API and get the HTML back. So the automated tests should use that instead of starting directly from HTML (which offer no garanty that this is the kind of HTML that the Mediawiki - still - deliver).
Testing this is not in 1.9 or 2.0
https://en.wikipedia.org/api/rest_v1/#/Transforms/post_transform_wikitext_to_html
I will have a look in detail to that ticket to see if it works now.
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
@benoit74 I think we should reassess this issue. The fix is foreseen for 2.0.0, but actually I believe it would be good to reasses earlier to see:
- we still have the problems
- the impact is still big enough so we have to fix in 1.17.0
- we could close/merge https://github.com/openzim/mwoffliner/pull/1821