trafilatura xml extraction leads to <graphic> tags in the wrong place.

xml extraction leads to <graphic> tags in the wrong place.

Open joschu opened this issue 3 years ago • 4 comments

import trafilatura
downloaded = trafilatura.fetch_url('https://en.wikipedia.org/wiki/Laplace_distribution')
ex=trafilatura.extract(downloaded,output_format='xml',include_images=True)
print(ex)

In several places, I see a <graphic>...</graphic> region in the wrong place. For example, trafilatura produces

    <p>where <graphic> is the generalized exponential integral function </graphic></p>

from

Jan 08 '22 00:01 joschu

Hi @joschu, thanks for your feedback. I can reproduce the bug.

The HTML syntax behind Wikipedia's formulas is quite complex and it seems to confuse the extractor. This is how E_{n}() is coded:

<span class="mwe-math-mathml-inline mwe-math-mathml-a11y" style="display: none;"><math xmlns="http://www.w3.org/1998/Math/MathML" alttext="{\displaystyle E_{n}()}">
  <semantics>
    <mrow class="MJX-TeXAtom-ORD">
      <mstyle scriptlevel="0" displaystyle="true">
        <msub>
          <mi>E</mi>
          <mrow class="MJX-TeXAtom-ORD">
            <mi>n</mi>
          </mrow>
        </msub>
        <mo stretchy="false">(</mo>
        <mo stretchy="false">)</mo>
      </mstyle>
    </mrow>
    <annotation encoding="application/x-tex">{\displaystyle E_{n}()}</annotation>
  </semantics>
</math></span><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/293d17c59c25748cbad9dbc20020c1649a98313e" class="mwe-math-fallback-image-inline" aria-hidden="true" style="vertical-align: -0.838ex; width:4.743ex; height:2.843ex;" alt="{\displaystyle E_{n}()}">

Jan 10 '22 13:01 adbar

@adbar Is there any progress on this issue? based on my testing, alt from wikipedia formulas still cannot be parsed and kept in the output.

Aug 15 '23 20:08 szhengac

@szhengac Nothing new at the moment but the alt issue you're mentionning is only loosely related to this one, can you give an example?

Aug 16 '23 10:08 adbar

@adbar If you use wiki page: https://en.wikipedia.org/wiki/Condorcet_paradox with extract, you will find the math equations are missing, e.g., the one near Using the central limit theorem, we show that

Aug 16 '23 17:08 szhengac

trafilatura trafilatura copied to clipboard

xml extraction leads to <graphic> tags in the wrong place.

trafilatura
trafilatura copied to clipboard