trafilatura icon indicating copy to clipboard operation
trafilatura copied to clipboard

xml extraction leads to <graphic> tags in the wrong place.

Open joschu opened this issue 3 years ago • 4 comments

import trafilatura
downloaded = trafilatura.fetch_url('https://en.wikipedia.org/wiki/Laplace_distribution')
ex=trafilatura.extract(downloaded,output_format='xml',include_images=True)
print(ex)

In several places, I see a <graphic>...</graphic> region in the wrong place. For example, trafilatura produces

    <p>where <graphic> is the generalized exponential integral function </graphic></p>

from image

joschu avatar Jan 08 '22 00:01 joschu

Hi @joschu, thanks for your feedback. I can reproduce the bug.

The HTML syntax behind Wikipedia's formulas is quite complex and it seems to confuse the extractor. This is how E_{n}() is coded:

<span class="mwe-math-mathml-inline mwe-math-mathml-a11y" style="display: none;"><math xmlns="http://www.w3.org/1998/Math/MathML" alttext="{\displaystyle E_{n}()}">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle scriptlevel="0" displaystyle="true">
        <msub>
          <mi>E</mi>
          <mrow class="MJX-TeXAtom-ORD">
            <mi>n</mi>
          </mrow>
        </msub>
        <mo stretchy="false">(</mo>
        <mo stretchy="false">)</mo>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\displaystyle E_{n}()}</annotation>
  </semantics>
</math></span><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/293d17c59c25748cbad9dbc20020c1649a98313e" class="mwe-math-fallback-image-inline" aria-hidden="true" style="vertical-align: -0.838ex; width:4.743ex; height:2.843ex;" alt="{\displaystyle E_{n}()}">

adbar avatar Jan 10 '22 13:01 adbar

@adbar Is there any progress on this issue? based on my testing, alt from wikipedia formulas still cannot be parsed and kept in the output.

szhengac avatar Aug 15 '23 20:08 szhengac

@szhengac Nothing new at the moment but the alt issue you're mentionning is only loosely related to this one, can you give an example?

adbar avatar Aug 16 '23 10:08 adbar

@adbar If you use wiki page: https://en.wikipedia.org/wiki/Condorcet_paradox with extract, you will find the math equations are missing, e.g., the one near Using the central limit theorem, we show that

szhengac avatar Aug 16 '23 17:08 szhengac