trafilatura icon indicating copy to clipboard operation
trafilatura copied to clipboard

anchor issue

Open pieterhartel opened this issue 3 years ago • 5 comments

It seems that sometimes a link without an href is ignored. Consider the sample html below:

$ cat anchor.html 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head>
</head>
<body>

<h1>FOO.</h1>
<p><strong>FOO!</strong></p>
<p>BE AWARE OF SCAMMERS WHO COPY OUR SITE! This is our one and ONLY site - <a href="http://peyueomdqxfjxtpg.onion">http://peyueomdqxfjxtpg.onion</a> Please bookmark us.</p>

    <h1>The quick brown fox jumps over the lazy dog  1</h1>
    <a>The quick brown fox jumps over the lazy dog  2</a>
    <h1><a>The quick brown fox jumps over the lazy dog  3</a></h1>
    The quick brown fox jumps over the lazy dog  4
Lorem ipsum
</body></html>

I would expect four occurrences of "The quick brown fox jumps over the lazy dog', with numbers 1,2,3 and 4. But #3 is missing:

$ trafilatura --json --links <anchor.html 
{"title": "FOO.", "author": null, "hostname": null, "date": null, "categories": "", "tags": "",
"fingerprint": "O0AByIFzTc/NCqx2cgJPXyjnK3s=", "id": null, "license": null,
"raw-text": "FOO. FOO! BE AWARE OF SCAMMERS WHO COPY OUR SITE! This is our one and ONLY site - http://peyueomdqxfjxtpg.onion Please bookmark us. The quick brown fox jumps over the lazy dog 1 The quick brown fox jumps over the lazy dog 2 The quick brown fox jumps over the lazy dog 4 Lorem ipsum",
"source": null, "source-hostname": null, "excerpt": null,
"text": "FOO.\nFOO!\nBE AWARE OF SCAMMERS WHO COPY OUR SITE! This is our one and ONLY site - http://peyueomdqxfjxtpg.onion Please bookmark us.\nThe quick brown fox jumps over the lazy dog 1\nThe quick brown fox jumps over the lazy dog 2\nThe quick brown fox jumps over the lazy dog 4\nLorem ipsum",
"comments": ""}

pieterhartel avatar Nov 18 '21 12:11 pieterhartel

@pieterhartel There was a small issue here which I fixed, the rest can be explained by the orphan text at the bottom. If you write <p>The quick brown fox jumps over the lazy dog 4</p> then you will see it in the output.

The reason is that trailing titles at the bottom of articles are discarded during extraction, it enhances the quality of extraction. In this particular case it does not work but in general it is not a bug IMO.

adbar avatar Jan 28 '22 17:01 adbar

@adbar wrote "The reason is that trailing titles at the bottom of articles are discarded during extraction". I don't think that this is the case. There is text following the <h1> element in the example. I installed the latest version of trafilatura and in the example added more text after the <h1> and the title still does not show up.

pieterhartel avatar Jan 30 '22 15:01 pieterhartel

I get your point, but the last title in your example is followed by orphan text without a tag, so the last tag seen by the parser is <h1>.

adbar avatar Feb 21 '22 12:02 adbar

I don't know how common this is, but there are definitely pages where there is valuable text in h1 blocks lower in the page . An example I ran into is https://mywellself.ca/about-us .

Is there any workaround to extract the info in these multiple h1 blocks and include it ?

chakravir avatar Sep 11 '23 02:09 chakravir

@chakravir Trafilatura tries to work in a generic way and there is only little potential for customization.

adbar avatar Oct 10 '23 12:10 adbar