trafilatura
trafilatura copied to clipboard
Missing h1 heading if <header> outside of <article>
I'm having a consistent problem having the article content include the main h1 heading if it's in a <heading>
element which is outside of the <article>
.
Common example is WaPo - e.g. https://www.washingtonpost.com/dc-md-va/2024/05/14/maryland-democratic-senate-primary/
The extracted content begins at "Prince George’s County Executive Angela D. Alsobrooks..." - it misses the h1 as well as the subheading right below it.
I've been trying to do some preprocessing of the HTML (basically moving the h1 element into the
Any thoughts?