php-readability icon indicating copy to clipboard operation
php-readability copied to clipboard

Keep h1 and other headings

Open Kdecherf opened this issue 2 years ago • 10 comments

Even though using h1 tags for sections inside an article is semantically wrong, a lot of websites are doing it anyway. So the idea here is to stop stripping headings, including h1 on Readability's side.

Fixes https://github.com/wallabag/wallabag/issues/5805

Kdecherf avatar Jun 11 '22 18:06 Kdecherf

Request for comment @j0k3r @jtojnar

Kdecherf avatar Jun 11 '22 18:06 Kdecherf

I think this sort of clean-up has some merit. Maybe we could only decide to clean out the h1 if there is only a single one, like we do with h2. And only clean up h2 if there is no h1?

jtojnar avatar Jun 11 '22 22:06 jtojnar

@jtojnar done

Kdecherf avatar Jun 28 '22 19:06 Kdecherf

Pull Request Test Coverage Report for Build 2583510249

  • 4 of 4 (100.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.2%) to 84.669%

Totals Coverage Status
Change from base Build 2486672970: 0.2%
Covered Lines: 602
Relevant Lines: 711

💛 - Coveralls

coveralls avatar Jun 28 '22 19:06 coveralls

I'm currently having a second though about this cleanup.

Take this link: https://interestingengineering.com/innovation/china-plans-to-build-the-worlds-first-waterless-nuclear-reactor The article only contains one h2 entity and is less than 100 characters (« China's plans to curb its carbon emissions »), thus it is removed by the routine. However this heading has its importance in this context.

What should we do? Maybe we could remove the length condition and replace it with something like "if it is similar to the article's title"?

Kdecherf avatar Aug 15 '22 12:08 Kdecherf

Similarity would make sense but then we would need to decide on the precise metric.

Another possible heuristic would be checking if the heading is the first element in the content. Then it would spuriously preserve the heading in the case of <summary> <heading> <actual content> but that would resolve itself if we drop the summary (which is orthogonal to this). And under-removal is probably better than over-removal anyway.

jtojnar avatar Aug 15 '22 13:08 jtojnar

Another possible heuristic would be checking if the heading is the first element in the content

Just checking if the first child of the content is h*, right?

j0k3r avatar Sep 29 '22 12:09 j0k3r

Right, that is what I meant.

jtojnar avatar Sep 29 '22 14:09 jtojnar

I'm trying to resume work on this PR.

I've made a small check on entries hosted on my instance, using a query like select id, substring(content FOR 250) from wallabag_entry where substring(content FOR 15) ilike '<h_>%' limit 5;.

It seems that there are several cases where the content begins with a legitimate heading entity, for example:

  • https://futurism.com/the-byte/china-ai-prosecutor-crimes (h2 as a subtitle)
  • https://www.theregister.com/2019/10/23/ai_dataset_imagenet_consent/ (h2 as a subtitle)
  • https://affordance.framasoft.org/2022/03/par-dela-like-colere/ (h2)

What should we do?

Kdecherf avatar Apr 07 '23 17:04 Kdecherf

It seems that there are several cases where the content begins with a legitimate heading entity, for example:

Would not those articles still begin with h1 pre-cleanup? In that case, the second heuristic would only remove the h1 since it is the first heading in the content.

The third one would not be matched by the first heuristic.

Ideally, we would combine both.

We should add a test suite for all these cases so that we can better discuss what is happening.

jtojnar avatar Apr 12 '23 04:04 jtojnar