Adrien Barbaresi
Adrien Barbaresi
I'm not sure what happens here but this is odd indeed. Note that if you can use a [web archive](https://archive.is/SS9w9) to reproduce the errors later. In general, duplicated elements can...
@fortyfourforty The integrated deduplication does prevent identical text segments on the same page.
It's not a bug in itself be I agree things could be improved, do you want to work on a PR?
@naktinis You wrote code targeting tables, maybe you are also interested.
@chitralverma I won't implement this now but I'm open to review a pull request if you're interested.
Thanks for the detailed description, it seems to be a bug indeed.
@masylum I need more context to reproduce the bug, the following HTML is not enough, there is no HTML code in the output. Could you please try to make the...
I can indeed reproduce the bug. Images are not my priority, the corresponding code mostly consists of a series of contributions and it's not perfect. Let's see if someone can...
For further reference: see also https://github.com/adbar/trafilatura/issues/662.
@naktinis Yes there is a reason, text within div (and nothing else) is generally undesirable. It is always a tradeoff between precision and recall. The easiest way I see is...