Adrien Barbaresi comments

Results 412 comments of


                                            Adrien Barbaresi

XML Parsing breaks on valid HTML

This ongoing PR adopts a different approach to doc sanitizing, it should also solve this problem, although I can't replicate it.

XML Parsing breaks on valid HTML

@Jufik Is the problem solved?

Corrupted Markdown output when TXT+formatting

Hi @clach04, thanks for your feedback. First, I think you could simplify the test: ``` wget -O wget_output.html http://www.pcgamer.com/2012/08/09/an-illusionist-in-skyrim-part-1/ cat wget_output.html | trafilatura --formatting ``` Then there are two different...

Proxy support to Trafilatura

Trafilatura's download utilities should stay simple in order not to confuse users. There are lots of alternatives and downloading at scale is a different challenge altogether. A worst case solution...

`included_images` failed when trying to extract images in a table

Hi @ChangyaoTian, thanks for your feedback, it appears there is an issue with table processing here. Images are not my priority but I'll leave the thread open.

`included_images` failed when trying to extract images in a table

@drunkpig Would you be interested in drafting a pull request?

`included_images` failed when trying to extract images in a table

@drunkpig Please go ahead then.

Empty h1 blocks non-empty h2

Hi @pieterhartel, these are corner cases but it's a metadata extraction problem indeed.

author metadata field is null for YouTube videos

Hi @basilioss, I can reproduce the issue, I assume it's necessary to add an additional X-Path expression to target authors names on Youtube.

author metadata field is null for YouTube videos

I regularly add XPath expressions to address metadata issues, e.g. #567. I tried to fix this issue but Youtube extraction is too variable for a generic extractor like Trafilatura, it...