Adrien Barbaresi comments

Results 412 comments of


                                            Adrien Barbaresi

Merge multiple nodes returned by XPath

@hugoobauer Are you still working on it or should I close the PR for now?

TXT output doesn't produce markdown-compliant paragraphs

Thanks for your feedback, there is still an ambiguity as the output is not officially Markdown but TXT with Markdown elements depending on the configuration. So yes, it can be...

Entire/majority content of these 2 sites being missed out

The page `https://kickstarter.mycaptain.in/privacy-policy` puts content in `` tags, which is rarely seen (`` tags are expected). The page `https://www.shopify.com/legal/privacy` uses several `` tags within a `` frame which confuses the...

save cookies on redirect

Good point, I know this kind of problem. There are two different libraries performing the requests, depending on whether the machine has pycurl or not. It would mean finding a...

Here is an interesting example... any tips?

Hi @krstp, indeed, the extraction algorithms fail to capture the text, not sure why. I'll leave the thread open to see if we can find a solution.

Returns horribly bad result for MSN page

I can reproduce the issue, there seems to be an issue with the parser and non-standard HTML code.

Is it possible to get the metadata with markdown format?

Good point, the code here could definitely be improved to add further metadata: https://github.com/adbar/trafilatura/blob/123414cae5f927e743f5eced2cd43b81a65fc43c/trafilatura/xml.py#L41

Consider switching from lxml's clean_html for enhanced security (and possibly performance)

Thanks for the information, I've been watching nh3 closely and could use it and/or re-implement what I needed from the cleaner (nothing critical).

XML Parsing breaks on valid HTML

Hi @Jufik, I cannot reproduce the bug, which platform are you using?

XML Parsing breaks on valid HTML

There are sometimes problems with LXML on M1/M2 platforms. Installing trafilatura (and thus lxml) with brew could help. We could also sanitize the output as you say.