Adrien Barbaresi

Results 412 comments of Adrien Barbaresi

This ongoing PR adopts a different approach to doc sanitizing, it should also solve this problem, although I can't replicate it.

@Jufik Is the problem solved?

Hi @clach04, thanks for your feedback. First, I think you could simplify the test: ``` wget -O wget_output.html http://www.pcgamer.com/2012/08/09/an-illusionist-in-skyrim-part-1/ cat wget_output.html | trafilatura --formatting ``` Then there are two different...

Trafilatura's download utilities should stay simple in order not to confuse users. There are lots of alternatives and downloading at scale is a different challenge altogether. A worst case solution...

Hi @ChangyaoTian, thanks for your feedback, it appears there is an issue with table processing here. Images are not my priority but I'll leave the thread open.

@drunkpig Would you be interested in drafting a pull request?

Hi @pieterhartel, these are corner cases but it's a metadata extraction problem indeed.

Hi @basilioss, I can reproduce the issue, I assume it's necessary to add an additional X-Path expression to target authors names on Youtube.

I regularly add XPath expressions to address metadata issues, e.g. #567. I tried to fix this issue but Youtube extraction is too variable for a generic extractor like Trafilatura, it...