ftr-site-config
ftr-site-config copied to clipboard
theguardian: keep svg and figcaption tags, prune content
This PR removes stripping of figcaption tags for images and re-enables pruning of content.
This will, among other things, prune empty malformed figure tags (see https://github.com/j0k3r/php-readability/pull/66 for related info)
Will test and post again this weekend.
@fivefilters Have you tried it?
@j0k3r @Kdecherf, sorry completely forgot about this. Thanks for the reminder.
I've tested a bit now, but I can't see any specific URLs that these changes are trying to improve. Do you have some perhaps?
The reason for SVG stripping is to get rid of inline SVG that you can see on a page like this: https://www.theguardian.com/books/live/2016/oct/13/nobel-prize-in-literature-2016-liveblog Are there pages that contain SVGs relevant to the content that we really need to preserve.
I'm happy to have figcaptions preserved (I can't remember why they were being stripped in the first place).
As for pruning, I don't mind removing it if it improves things for you. In the past the pruning in the original Readability code (which to be fair is a bit dated in the version we use in Full-Text RSS) often removed more than desired, so it's just been a habit to disable pruning for all new site config files produced and explicitly target undesired elements with the strip rules. But I didn't see any odd results having pruning enabled in my tests for this site, so I'm okay with that particular change.
Well… I forgot the context of the svg change, I don't remember any specific URLs right now.
The reason for SVG stripping is to get rid of inline SVG that you can see on a page like this: https://www.theguardian.com/books/live/2016/oct/13/nobel-prize-in-literature-2016-liveblog Are there pages that contain SVGs relevant to the content that we really need to preserve.
@fivefilters our of doubt I've tested it on wallabag and it does not seem to raise any issue. Is there a notable change on your side?
@Kdecherf Unfortunately yes, we end up with inline SVG icons that we don't want. That's why I was curious if there were SVG elements that are actually related to the content (e.g. illustrations) that we can preserve some other way.
@fivefilters now I'm curious, could you provide a screenshot and/or export of what you get please?
@Kdecherf if I get a chance to test again with it disabled I'll send you a screenshot. But if you want to know what kind of SVGs are included, just open up the URL I sent you before and do a view source in your browser. You'll find plenty of small, irrelevant <svg>
elements in the source HTML.
@Kdecherf is this PR still needed?
At least the //svg has to be stripped, because of this mega hyphens in given URL
I don't see an issue in wallabag with this article.
@HolgerAusB I'll try to take time before the end of the year to check this PR