Adrien Barbaresi comments

Results 412 comments of


                                            Adrien Barbaresi

ValueError in xml

My bad, the bug occurs when Trafilatura is used with Python, the CLI suppresses the error.

Crawler doesn't extract any links from Google Cloud documentation website

That's correct, there is something wrong with relative link processing here.

Crawler doesn't extract any links from Google Cloud documentation website

Google is blacklisted by the underlying courlan package, this can simply be bypassed by passing the `strict=False` parameter to the `extract_links()` function in the spider module.

Crawler doesn't extract any links from Google Cloud documentation website

@cjgalvin There might be a problem with the `urllib3` dependency on this page. Try installing the optional `pycurl` package (which Trafilatura supports seamlessly), it is often better and faster.

Missing h1 heading if <header> outside of <article>

It is debatable whether titles are part of the main content, they are not always included in benchmarks. That being said the main title should also be present in the...

Investigate spacing in element tails

Thanks, this is because no space is inserted between the element text and what follows directly after (without being nested in an element, a tail in LXML).

Investigate spacing in element tails

Yes, I also think it's about defining tags (like ``) for which it is beneficial to insert a space.

links/urls are not apprearing using extract

@alroythalus I just tested the Github example and the links are in the XML output, here is a small example: ``` To remove content or information you have publicly posted,...

Faulty extraction for very short documents

Hi @Psynbiotik, thanks for the detailed bug report. Trafilatura is geared towards real-world cases and synthetic examples do not always work well. The problem here is that the text is...

Faulty extraction for very short documents

Then it would be interesting to isolate the problem so that I can reproduce it. In your example both examples are linked to another.