Adrien Barbaresi
Adrien Barbaresi
@naktinis Do you have anything new to report on this and it is still an issue?
The approach looks good! The Readabillity port was using lxml even before, I just simplified it and integrated it directly since it wasn't maintained anymore.
Generating binding means additional maintenance work but if someone wants to publish a python package for go-readability I'd be happy to test it.
I lack the time to check the issue right now. You can provide a list of XPath expressions to the extraction function (`prune_xpath` parameter).
I can answer questions on issues but not provide code snippets, if it is not in the documentation just look at the tests.
Hi @drFerg, definitely, Trafilatura supports custom user-agent settings, courlan could also do so. The config file approach could be replicated here. Are you interested in drafting a pull request?
Hi @thsunkid, thanks for the detailed report and the example. We're talking about a web page which is very large (> 8Mb) and contains a lot of similar elements. This...
Hi @SamComber, the server response is not used by Htmldate, the header is the one in the HTML document where a meta tag is set to a very late date,...
Hi @TheCutestCat, when dates are found using HTML markup you get the time zone, when they are extracted from free text regexes are applied. The regular expressions don't include time...
@georgedorn Thanks for your feedback, the documentation could indeed be extended.