ftr-site-config icon indicating copy to clipboard operation
ftr-site-config copied to clipboard

theguardian: keep svg and figcaption tags, prune content

Open Kdecherf opened this issue 2 years ago • 9 comments

This PR removes stripping of figcaption tags for images and re-enables pruning of content.

This will, among other things, prune empty malformed figure tags (see https://github.com/j0k3r/php-readability/pull/66 for related info)

Kdecherf avatar Oct 29 '21 13:10 Kdecherf

Will test and post again this weekend.

fivefilters avatar Oct 30 '21 00:10 fivefilters

@fivefilters Have you tried it?

j0k3r avatar Jan 24 '22 11:01 j0k3r

@j0k3r @Kdecherf, sorry completely forgot about this. Thanks for the reminder.

I've tested a bit now, but I can't see any specific URLs that these changes are trying to improve. Do you have some perhaps?

The reason for SVG stripping is to get rid of inline SVG that you can see on a page like this: https://www.theguardian.com/books/live/2016/oct/13/nobel-prize-in-literature-2016-liveblog Are there pages that contain SVGs relevant to the content that we really need to preserve.

I'm happy to have figcaptions preserved (I can't remember why they were being stripped in the first place).

As for pruning, I don't mind removing it if it improves things for you. In the past the pruning in the original Readability code (which to be fair is a bit dated in the version we use in Full-Text RSS) often removed more than desired, so it's just been a habit to disable pruning for all new site config files produced and explicitly target undesired elements with the strip rules. But I didn't see any odd results having pruning enabled in my tests for this site, so I'm okay with that particular change.

fivefilters avatar Jan 24 '22 21:01 fivefilters

Well… I forgot the context of the svg change, I don't remember any specific URLs right now.

The reason for SVG stripping is to get rid of inline SVG that you can see on a page like this: https://www.theguardian.com/books/live/2016/oct/13/nobel-prize-in-literature-2016-liveblog Are there pages that contain SVGs relevant to the content that we really need to preserve.

@fivefilters our of doubt I've tested it on wallabag and it does not seem to raise any issue. Is there a notable change on your side?

Kdecherf avatar Feb 01 '22 22:02 Kdecherf

@Kdecherf Unfortunately yes, we end up with inline SVG icons that we don't want. That's why I was curious if there were SVG elements that are actually related to the content (e.g. illustrations) that we can preserve some other way.

fivefilters avatar Feb 01 '22 22:02 fivefilters

@fivefilters now I'm curious, could you provide a screenshot and/or export of what you get please?

Kdecherf avatar Feb 01 '22 23:02 Kdecherf

@Kdecherf if I get a chance to test again with it disabled I'll send you a screenshot. But if you want to know what kind of SVGs are included, just open up the URL I sent you before and do a view source in your browser. You'll find plenty of small, irrelevant <svg> elements in the source HTML.

fivefilters avatar Feb 03 '22 23:02 fivefilters

@Kdecherf is this PR still needed?

At least the //svg has to be stripped, because of this mega hyphens in given URL grafik

I don't see an issue in wallabag with this article.

HolgerAusB avatar Oct 05 '23 05:10 HolgerAusB

@HolgerAusB I'll try to take time before the end of the year to check this PR

Kdecherf avatar Nov 13 '23 12:11 Kdecherf