trafilatura
trafilatura copied to clipboard
anchor issue
It seems that sometimes a link without an href is ignored. Consider the sample html below:
$ cat anchor.html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head>
</head>
<body>
<h1>FOO.</h1>
<p><strong>FOO!</strong></p>
<p>BE AWARE OF SCAMMERS WHO COPY OUR SITE! This is our one and ONLY site - <a href="http://peyueomdqxfjxtpg.onion">http://peyueomdqxfjxtpg.onion</a> Please bookmark us.</p>
<h1>The quick brown fox jumps over the lazy dog 1</h1>
<a>The quick brown fox jumps over the lazy dog 2</a>
<h1><a>The quick brown fox jumps over the lazy dog 3</a></h1>
The quick brown fox jumps over the lazy dog 4
Lorem ipsum
</body></html>
I would expect four occurrences of "The quick brown fox jumps over the lazy dog', with numbers 1,2,3 and 4. But #3 is missing:
$ trafilatura --json --links <anchor.html
{"title": "FOO.", "author": null, "hostname": null, "date": null, "categories": "", "tags": "",
"fingerprint": "O0AByIFzTc/NCqx2cgJPXyjnK3s=", "id": null, "license": null,
"raw-text": "FOO. FOO! BE AWARE OF SCAMMERS WHO COPY OUR SITE! This is our one and ONLY site - http://peyueomdqxfjxtpg.onion Please bookmark us. The quick brown fox jumps over the lazy dog 1 The quick brown fox jumps over the lazy dog 2 The quick brown fox jumps over the lazy dog 4 Lorem ipsum",
"source": null, "source-hostname": null, "excerpt": null,
"text": "FOO.\nFOO!\nBE AWARE OF SCAMMERS WHO COPY OUR SITE! This is our one and ONLY site - http://peyueomdqxfjxtpg.onion Please bookmark us.\nThe quick brown fox jumps over the lazy dog 1\nThe quick brown fox jumps over the lazy dog 2\nThe quick brown fox jumps over the lazy dog 4\nLorem ipsum",
"comments": ""}
@pieterhartel There was a small issue here which I fixed, the rest can be explained by the orphan text at the bottom. If you write <p>The quick brown fox jumps over the lazy dog 4</p>
then you will see it in the output.
The reason is that trailing titles at the bottom of articles are discarded during extraction, it enhances the quality of extraction. In this particular case it does not work but in general it is not a bug IMO.
@adbar wrote "The reason is that trailing titles at the bottom of articles are discarded during extraction".
I don't think that this is the case. There is text following the <h1>
element in the example. I installed the latest version of trafilatura and in the example added more text after the <h1>
and the title still does not show up.
I get your point, but the last title in your example is followed by orphan text without a tag, so the last tag seen by the parser is <h1>
.
I don't know how common this is, but there are definitely pages where there is valuable text in h1 blocks lower in the page . An example I ran into is https://mywellself.ca/about-us .
Is there any workaround to extract the info in these multiple h1 blocks and include it ?
@chakravir Trafilatura tries to work in a generic way and there is only little potential for customization.