Adrien Barbaresi

Results 412 comments of Adrien Barbaresi

My bad, the bug occurs when Trafilatura is used with Python, the CLI suppresses the error.

That's correct, there is something wrong with relative link processing here.

Google is blacklisted by the underlying courlan package, this can simply be bypassed by passing the `strict=False` parameter to the `extract_links()` function in the spider module.

@cjgalvin There might be a problem with the `urllib3` dependency on this page. Try installing the optional `pycurl` package (which Trafilatura supports seamlessly), it is often better and faster.

It is debatable whether titles are part of the main content, they are not always included in benchmarks. That being said the main title should also be present in the...

Thanks, this is because no space is inserted between the element text and what follows directly after (without being nested in an element, a tail in LXML).

Yes, I also think it's about defining tags (like ``) for which it is beneficial to insert a space.

@alroythalus I just tested the Github example and the links are in the XML output, here is a small example: ``` To remove content or information you have publicly posted,...

Hi @Psynbiotik, thanks for the detailed bug report. Trafilatura is geared towards real-world cases and synthetic examples do not always work well. The problem here is that the text is...

Then it would be interesting to isolate the problem so that I can reproduce it. In your example both examples are linked to another.