trafilatura
trafilatura copied to clipboard
Include links and Include formatting do not work together properly
version: 1.7.0.
Please see Problem 3 below as the main issue I am reporting. First two problems are given just to make sure I didn't completely misunderstand how the library is supposed to work. Sorry for a very messy issue, as it seems like any little change I make to the inputs completely changes the output.
Starting with the code:
html = """
<!DOCTYPE html>
<html>
<body>
<div>
<h1>
This is the title of the page
</h1>
<p>
This is a paragraph, and it contains a <em><a href="https://www.example.com/"> bolded link to some page</a>, some additional bolded text</em> and some text that is not bolded.
</p>
</div>
</body>
</html>
"""
result = extract(
html,
output_format="md",
include_links=False,
include_formatting=False,
)
print(result)
I get results as expected:
This is a paragraph, and it contains a bolded link to some page, some additional bolded text and some text that is not bolded.
Problem 1
Setting include_links=True
does not change this output at all. I would expect the link to be included as a markdown slug url, but maybe I am misunderstanding what include_links
does.
Problem 2
Setting include_formatting=True
does not change the output either.
Problem 3 (main issue)
Setting <div class="content">
changes above behavior, and now include_links
and include_formatting
on their own seem to work, however the paragraph is always duplicated (see output below).
More importantly, if both inlcude_formatting=True
and include_links=True
, then all the bold text jumps to the end of the paragraph and links are ignored.
Here is the code with changes applied to highlight the main issue I am reporting:
html = """
<!DOCTYPE html>
<html>
<body>
<div class="content">
<h1>
This is the title of the page
</h1>
<p>
This is a paragraph, and it contains a <em><a href="https://www.example.com/"> bolded link to some page</a>, some additional bolded text</em> and some text that is not bolded.
</p>
</div>
</body>
</html>
"""
result = extract(
html,
output_format="md",
include_links=True,
include_formatting=True,
)
print(result)
Output:
# This is the title of the page
This is a paragraph, and it contains a
* and some text that is not bolded.
bolded link to some page, some additional bolded text*
This is a paragraph, and it contains a
* and some text that is not bolded.
bolded link to some page, some additional bolded text*
Additional note: this seems to only happen if there is no space between <em>
and <a
. When space is added, links and formatting are completely ignored.
Hi @ibestvina, this is a known issue. I'm not primarily working with these options and added them after feature requests, so the interaction between option can be patchy at times. I'm open to accept PRs on the topic.
This is indeed a big issue as anything with a link is not scraped which leaves a lot of the page. Any PRs on this that we can help out to complete? Critical for a scraper
@mertdeveci5 There are no PRs at the moment as it's not my main focus and nobody else seems to be contributing on this. Do you need both formatting and links? Links alone work fine, that would be the critical function for a scraper e.g. in a SEO context (where Trafilatura is used).
Links themselves - to give you the full context: Tried to scrape jam.dev/careers
Trafilatura can scrape everything except the links in the bottom where the actual job postings are listed. Tried it with a lot of websites but for half of them it did not work. Couldn't figure out if I am doing something wrong
This is another issue then, not a problem between extraction options but (probably) a case where the extractor misses the relevant section of the page.
edit: see #518