Adrien Barbaresi comments

Results 412 comments of


                                            Adrien Barbaresi

Few issues with tests.

Hi @majcl, I did not run the tests for a while, the last time I checked it worked. 1. Certain packages need to be updated or certain lines changed over...

Option to remove unreachable pages and pages not strictly in the same domain

Hi @MTB-nsartor, thanks for your feedback! 1. The command-line interface keeps track of failed downloads. I didn't include it in the Python library so far but as you say it's...

Option to remove unreachable pages and pages not strictly in the same domain

The second part (the actual bug) is now fixed, I'll release a new version of the underlying library ([courlan](https://github.com/adbar/courlan)) soon. Yes, feel free to draft a PR for the `downloads._handle_response`...

Check URLs passed to courlan functions `extract_links` and `fix_relative_urls`

This could also happen by calling the lxml method `.make_links_absolute`: ``` if include_links: tree.make_links_absolute(url, resolve_base_href=False) ```

probe_alternative_homepage no_ssl arg from fetch_url

Hi @hyshandler, thanks for your feedback, I cannot reproduce the bug, maybe your version of `certifi` isn't up-to-date. Regardless of this particular webpage it could make sense to use a...

Image markdown not included during processing

Hi @kianwilcox, the package mainly focuses on text and tries to get rid of unnecessary sections. It could be that the central images are wrongly considered to be off-limits or...

Image markdown not included during processing

The extraction implies to get clean HTML and convert elements, maybe the images get discarded here: https://github.com/adbar/trafilatura/blob/123414cae5f927e743f5eced2cd43b81a65fc43c/trafilatura/htmlprocessing.py#L48 Or maybe they are not handled correctly by this part of the code:...

Adrien Barbaresi

Few issues with tests.

Option to remove unreachable pages and pages not strictly in the same domain

Option to remove unreachable pages and pages not strictly in the same domain

Check URLs passed to courlan functions `extract_links` and `fix_relative_urls`

probe_alternative_homepage no_ssl arg from fetch_url

Image markdown not included during processing

Image markdown not included during processing

Image markdown not included during processing

Question about the title

Question about the title