Adrien Barbaresi
Adrien Barbaresi
Hi @majcl, I did not run the tests for a while, the last time I checked it worked. 1. Certain packages need to be updated or certain lines changed over...
Hi @MTB-nsartor, thanks for your feedback! 1. The command-line interface keeps track of failed downloads. I didn't include it in the Python library so far but as you say it's...
The second part (the actual bug) is now fixed, I'll release a new version of the underlying library ([courlan](https://github.com/adbar/courlan)) soon. Yes, feel free to draft a PR for the `downloads._handle_response`...
This could also happen by calling the lxml method `.make_links_absolute`: ``` if include_links: tree.make_links_absolute(url, resolve_base_href=False) ```
Hi @hyshandler, thanks for your feedback, I cannot reproduce the bug, maybe your version of `certifi` isn't up-to-date. Regardless of this particular webpage it could make sense to use a...
Hi @kianwilcox, the package mainly focuses on text and tries to get rid of unnecessary sections. It could be that the central images are wrongly considered to be off-limits or...
The extraction implies to get clean HTML and convert elements, maybe the images get discarded here: https://github.com/adbar/trafilatura/blob/123414cae5f927e743f5eced2cd43b81a65fc43c/trafilatura/htmlprocessing.py#L48 Or maybe they are not handled correctly by this part of the code:...
There are definitely missing images, see discussion above.
I may use different heuristics but this is open for debate.
My bad, I answered question 2. Title extraction roughly works as you described in your list, with two additional steps: - First and foremost HTML meta or JSON information is...