Adrien Barbaresi
Adrien Barbaresi
This ongoing PR adopts a different approach to doc sanitizing, it should also solve this problem, although I can't replicate it.
@Jufik Is the problem solved?
Hi @clach04, thanks for your feedback. First, I think you could simplify the test: ``` wget -O wget_output.html http://www.pcgamer.com/2012/08/09/an-illusionist-in-skyrim-part-1/ cat wget_output.html | trafilatura --formatting ``` Then there are two different...
Trafilatura's download utilities should stay simple in order not to confuse users. There are lots of alternatives and downloading at scale is a different challenge altogether. A worst case solution...
Hi @ChangyaoTian, thanks for your feedback, it appears there is an issue with table processing here. Images are not my priority but I'll leave the thread open.
@drunkpig Would you be interested in drafting a pull request?
@drunkpig Please go ahead then.
Hi @pieterhartel, these are corner cases but it's a metadata extraction problem indeed.
Hi @basilioss, I can reproduce the issue, I assume it's necessary to add an additional X-Path expression to target authors names on Youtube.
I regularly add XPath expressions to address metadata issues, e.g. #567. I tried to fix this issue but Youtube extraction is too variable for a generic extractor like Trafilatura, it...