trafilatura icon indicating copy to clipboard operation
trafilatura copied to clipboard

`include_images` changes text extraction

Open carschno opened this issue 4 years ago • 5 comments

Trafilatura version: 1.2.0

I have noticed that adding the include_images=True argument to trafilatura.extract() changes the output text.

To reproduce it:

import trafilatura
from trafilatura import fetch_url

url = "https://www.tropenmuseum.nl/nl/zien-en-doen/bisjpalen-restauratie"

In [43]: trafilatura.bare_extraction(fetch_url(url), include_images=False)
Out[43]: 
{'title': 'Bisjpalen restauratie',
 'author': None,
 'url': 'https://www.tropenmuseum.nl/nl/zien-en-doen/bisjpalen-restauratie',
 'hostname': 'tropenmuseum.nl',
 'description': 'Een bijzondere restauratie van twaalf gigantische bisjpalen in de monumentale Lichthal.',
 'sitename': 'Tropenmuseum in Amsterdam',
 'date': '2022-01-01',
 'categories': [],
 'tags': [],
 'fingerprint': None,
 'id': None,
 'license': None,
 'body': None,
 'comments': '',
 'commentsbody': None,
 'raw_text': None,
 'text': 'Deze rituele palen uit de Indonesische provincie Papoea maken onderdeel uit van de wereldberoemde Nieuw-Guinea collectie van het museum. De bisjpalen collectie is bijzonder, omdat deze in het land van herkomst niet bewaard worden: de palen worden normaliter na afloop van de ceremonie in het moeras achtergelaten om weg te rotten\nBijzondere restauratie\nTropenmuseum restaureerde twaalf gigantische bisjpalen in de monumentale Lichthal.\nHerkomst\nBisjfeest\nBisjpalen zijn boomstammen waarin met het houtsnijwerk overleden dorpsgenoten zijn uitgebeeld. De palen worden gebruikt om de doden te eren tijdens een ‘bisjfeest’. Halverwege de vorige eeuw ontstond de angst dat het ritueel zou uitsterven, waardoor Tropenmuseum en Wereldmuseum besloten om grootscheeps bisjpalen aan te kopen. In Nederland bevindt zich nu de grootste collectie bisjpalen ter wereld\nRestauratie\nDe rituele palen zijn in de loop der jaren door stof, vocht en insecten aangetast. De restauratoren reinigen heel voorzichtig het kwetsbare oppervlak van de palen en zetten daarna de verf opnieuw vast. Tijdens de restauratie zijn de bisjpalen van dichtbij te bekijken. Bezoekers kunnen via een beeldscherm meekijken met het beeld van de microscoop; een uitgelezen kans om een museumobject met andere ogen te ervaren!'}

In [44]: trafilatura.bare_extraction(fetch_url(url), include_images=True)
Out[44]: 
{'title': 'Bisjpalen restauratie',
 'author': None,
 'url': 'https://www.tropenmuseum.nl/nl/zien-en-doen/bisjpalen-restauratie',
 'hostname': 'tropenmuseum.nl',
 'description': 'Een bijzondere restauratie van twaalf gigantische bisjpalen in de monumentale Lichthal.',
 'sitename': 'Tropenmuseum in Amsterdam',
 'date': '2022-01-01',
 'categories': [],
 'tags': [],
 'fingerprint': None,
 'id': None,
 'license': None,
 'body': None,
 'comments': '',
 'commentsbody': None,
 'raw_text': None,
 'text': 'Deze rituele palen uit de Indonesische provincie Papoea maken onderdeel uit van de wereldberoemde Nieuw-Guinea collectie van het museum. De bisjpalen collectie is bijzonder, omdat deze in het land van herkomst niet bewaard worden: de palen worden normaliter na afloop van de ceremonie in het moeras achtergelaten om weg te rotten\n/sites/default/files/styles/hero/public/bisjpalen.jpg?h=e442ce2f&itok=Jaeowj5G Tropenmuseum. Bisjpalen restauratie.'}

Note that the value for text is different. When images are included, the text stops shortly after the first (in this case: only) image.

This seems possibly related to #51 , but there is no exception raised here.

carschno avatar Apr 12 '22 09:04 carschno

Hi @carschno, I can reproduce the bug. Extraction with images isn't my priority but I'll try to look into it.

adbar avatar Apr 12 '22 12:04 adbar

@adbar Thanks! In case you have a pointer to the potentially relevant piece of the code, I might be able to investigate myself and create a PR (depending on how deeply the issue is rooted).

I understand that this behaviour is definitely not expected, right?

carschno avatar Apr 12 '22 12:04 carschno

No it isn't expected but it looks quite convoluted. The backup algorithm (internal fork of readability-lxml but identical here) triggers the error:

  • No images, backup algorithm used, everything is fine (that's the case I'm evaluating).
  • With images the heuristics of the backup algorithm doesn't work the same way, and the HTML sections around the images (I guess) are discarded. That's logical since images are often associated with undesirable content, I assume it's an unfortunate borderline case here.

If you want to look at the code, here are the sections concerned:

https://github.com/adbar/trafilatura/blob/146506a2a18e5ca99ee8cd9a5779cc9f137697aa/trafilatura/external.py#L34

https://github.com/adbar/trafilatura/blob/master/trafilatura/readability_lxml.py

You could maybe look into what happens to img elements in the latter.

adbar avatar Apr 12 '22 12:04 adbar

Digging deeper into the analysis of this error, this part of the HTML looks suspicious to me, in particular the | symbols in the srcset attributed of the three last sources:

      <div class="field field--name-field-hero-image field--type-image field--label-hidden field__items">
              <div class="field__item">    <picture>
                  <source srcset="Bisjpalen%20restauratie%20|%20Tropenmuseum%20in%20Amsterdam_files/bisjpalen.webp 1x" media="screen and (max-width: 767px)" type="image/webp">
              <source srcset="Bisjpalen%20restauratie%20|%20Tropenmuseum%20in%20Amsterdam_files/bisjpalen_002.webp 1x" media="screen and (min-width: 768px) and (max-width: 992px)" type="image/webp">
              <source srcset="Bisjpalen%20restauratie%20|%20Tropenmuseum%20in%20Amsterdam_files/bisjpalen_002.webp 1x" media="screen and (min-width: 993px)" type="image/webp">
              <source srcset="Bisjpalen%20restauratie%20|%20Tropenmuseum%20in%20Amsterdam_files/bisjpalen.jpg 1x" media="screen and (max-width: 767px)" type="image/jpeg">
              <source srcset="Bisjpalen%20restauratie%20|%20Tropenmuseum%20in%20Amsterdam_files/bisjpalen_002.jpg 1x" media="screen and (min-width: 768px) and (max-width: 992px)" type="image/jpeg">
              <source srcset="Bisjpalen%20restauratie%20|%20Tropenmuseum%20in%20Amsterdam_files/bisjpalen_002.jpg 1x" media="screen and (min-width: 993px)" type="image/jpeg">
                  <img src="Bisjpalen%20restauratie%20|%20Tropenmuseum%20in%20Amsterdam_files/bisjpalen_002.jpg" alt="Tropenmuseum. Bisjpalen restauratie. " typeof="foaf:Image">

However, this is visible to me only when I save the page locally. When it gets parsed in the browser (Firefox in my case), this part looks like this when I look at the 'Web Developer console':


                  <source srcset="/sites/default/files/styles/hero_mobile/public/bisjpalen.webp?h=c7551848&amp;itok=ajmSsyac 1x" media="screen and (max-width: 767px)" type="image/webp">
              <source srcset="/sites/default/files/styles/hero/public/bisjpalen.webp?h=e442ce2f&amp;itok=Jaeowj5G 1x" media="screen and (min-width: 768px) and (max-width: 992px)" type="image/webp">
              <source srcset="/sites/default/files/styles/hero/public/bisjpalen.webp?h=e442ce2f&amp;itok=Jaeowj5G 1x" media="screen and (min-width: 993px)" type="image/webp">
              <source srcset="/sites/default/files/styles/hero_mobile/public/bisjpalen.jpg?h=c7551848&amp;itok=ajmSsyac 1x" media="screen and (max-width: 767px)" type="image/jpeg">
              <source srcset="/sites/default/files/styles/hero/public/bisjpalen.jpg?h=e442ce2f&amp;itok=Jaeowj5G 1x" media="screen and (min-width: 768px) and (max-width: 992px)" type="image/jpeg">
              <source srcset="/sites/default/files/styles/hero/public/bisjpalen.jpg?h=e442ce2f&amp;itok=Jaeowj5G 1x" media="screen and (min-width: 993px)" type="image/jpeg">
                  <img src="/sites/default/files/styles/hero/public/bisjpalen.jpg?h=e442ce2f&amp;itok=Jaeowj5G" alt="Tropenmuseum. Bisjpalen restauratie. " typeof="foaf:Image">

I am not very familiar with how these JavaScript/HTML parsing works, but I guess that Trafilatura (or the underlying XML parser) tries to parse the plain HTML code and fails when hitting the | symbols, or something similar.

Does that make any sense at all?

carschno avatar Apr 22 '22 12:04 carschno

I could be wrong but I don't see any line in the code which could be affected by that. The vertical bars are between quotation marks so they are part of the image source just like any other symbol.

adbar avatar Apr 22 '22 14:04 adbar