trafilatura icon indicating copy to clipboard operation
trafilatura copied to clipboard

Missing h1 heading if <header> outside of <article>

Open chrisgoddard opened this issue 7 months ago • 2 comments

I'm having a consistent problem having the article content include the main h1 heading if it's in a <heading> element which is outside of the <article>.

Common example is WaPo - e.g. https://www.washingtonpost.com/dc-md-va/2024/05/14/maryland-democratic-senate-primary/

The extracted content begins at "Prince George’s County Executive Angela D. Alsobrooks..." - it misses the h1 as well as the subheading right below it.

I've been trying to do some preprocessing of the HTML (basically moving the h1 element into the

) - but I can't get it working. Going through the code I can't quite figure out why exactly its being filtered out in the first place.

Any thoughts?

chrisgoddard avatar Jul 11 '24 00:07 chrisgoddard