trafilatura
trafilatura copied to clipboard
Link proportion heuristic fails for link paragraph
Articles on the Fox News website contain links to other articles in the middle of the texts, the links all follow this pattern: p > a > strong > u (+all caps), this should get addressed by the link filter but it evades detection:
<p>
<a href="https://www.foxnews.com/apps-products?pid=AppArticleLink" target="_blank">
<strong><u>CLICK TO GET THE FOX NEWS APP</u></strong>
</a>
</p>
Found on fundus-evaluation.