Adrien Barbaresi

Results 412 comments of Adrien Barbaresi

@hugoobauer Are you still working on it or should I close the PR for now?

Thanks for your feedback, there is still an ambiguity as the output is not officially Markdown but TXT with Markdown elements depending on the configuration. So yes, it can be...

The page `https://kickstarter.mycaptain.in/privacy-policy` puts content in `` tags, which is rarely seen (`` tags are expected). The page `https://www.shopify.com/legal/privacy` uses several `` tags within a `` frame which confuses the...

Good point, I know this kind of problem. There are two different libraries performing the requests, depending on whether the machine has pycurl or not. It would mean finding a...

Hi @krstp, indeed, the extraction algorithms fail to capture the text, not sure why. I'll leave the thread open to see if we can find a solution.

I can reproduce the issue, there seems to be an issue with the parser and non-standard HTML code.

Good point, the code here could definitely be improved to add further metadata: https://github.com/adbar/trafilatura/blob/123414cae5f927e743f5eced2cd43b81a65fc43c/trafilatura/xml.py#L41

Thanks for the information, I've been watching nh3 closely and could use it and/or re-implement what I needed from the cleaner (nothing critical).

Hi @Jufik, I cannot reproduce the bug, which platform are you using?

There are sometimes problems with LXML on M1/M2 platforms. Installing trafilatura (and thus lxml) with brew could help. We could also sanitize the output as you say.