trafilatura Include links and Include formatting do not work together properly

version: 1.7.0.

Please see Problem 3 below as the main issue I am reporting. First two problems are given just to make sure I didn't completely misunderstand how the library is supposed to work. Sorry for a very messy issue, as it seems like any little change I make to the inputs completely changes the output.

Starting with the code:

html = """
<!DOCTYPE html>
<html>
<body>
	<div>
		<h1>
			This is the title of the page
		</h1>
		<p>
			This is a paragraph, and it contains a <em><a href="https://www.example.com/"> bolded link to some page</a>, some additional bolded text</em> and some text that is not bolded.
		</p>
	</div>
</body>
</html>
"""
    
result = extract(
    html,
    output_format="md",
    include_links=False,
    include_formatting=False,
)
print(result)

I get results as expected:

This is a paragraph, and it contains a bolded link to some page, some additional bolded text and some text that is not bolded.

Problem 1 Setting include_links=True does not change this output at all. I would expect the link to be included as a markdown slug url, but maybe I am misunderstanding what include_links does.

Problem 2 Setting include_formatting=True does not change the output either.

Problem 3 (main issue) Setting <div class="content"> changes above behavior, and now include_links and include_formatting on their own seem to work, however the paragraph is always duplicated (see output below).

More importantly, if both inlcude_formatting=True and include_links=True, then all the bold text jumps to the end of the paragraph and links are ignored.

Here is the code with changes applied to highlight the main issue I am reporting:

html = """
<!DOCTYPE html>
<html>
<body>
	<div class="content">
		<h1>
			This is the title of the page
		</h1>
		<p>
			This is a paragraph, and it contains a <em><a href="https://www.example.com/"> bolded link to some page</a>, some additional bolded text</em> and some text that is not bolded.
		</p>
	</div>
</body>
</html>
"""
    
result = extract(
    html,
    output_format="md",
    include_links=True,
    include_formatting=True,
)
print(result)

Output:

# This is the title of the page
This is a paragraph, and it contains a
* and some text that is not bolded.
bolded link to some page, some additional bolded text*
This is a paragraph, and it contains a
* and some text that is not bolded.
bolded link to some page, some additional bolded text*

Additional note: this seems to only happen if there is no space between <em> and <a. When space is added, links and formatting are completely ignored.

Feb 21 '24 13:02 ibestvina

Hi @ibestvina, this is a known issue. I'm not primarily working with these options and added them after feature requests, so the interaction between option can be patchy at times. I'm open to accept PRs on the topic.

Feb 21 '24 16:02 adbar

This is indeed a big issue as anything with a link is not scraped which leaves a lot of the page. Any PRs on this that we can help out to complete? Critical for a scraper

Feb 29 '24 14:02 mertdeveci5

@mertdeveci5 There are no PRs at the moment as it's not my main focus and nobody else seems to be contributing on this. Do you need both formatting and links? Links alone work fine, that would be the critical function for a scraper e.g. in a SEO context (where Trafilatura is used).

Mar 01 '24 12:03 adbar

Links themselves - to give you the full context: Tried to scrape jam.dev/careers

Trafilatura can scrape everything except the links in the bottom where the actual job postings are listed. Tried it with a lot of websites but for half of them it did not work. Couldn't figure out if I am doing something wrong

Mar 01 '24 16:03 mertdeveci5

This is another issue then, not a problem between extraction options but (probably) a case where the extractor misses the relevant section of the page.

edit: see #518

Mar 01 '24 17:03 adbar

trafilatura trafilatura copied to clipboard

Include links and Include formatting do not work together properly

trafilatura
trafilatura copied to clipboard