trafilatura
trafilatura copied to clipboard
xml extraction leads to <graphic> tags in the wrong place.
import trafilatura
downloaded = trafilatura.fetch_url('https://en.wikipedia.org/wiki/Laplace_distribution')
ex=trafilatura.extract(downloaded,output_format='xml',include_images=True)
print(ex)
In several places, I see a <graphic>...</graphic>
region in the wrong place.
For example, trafilatura produces
<p>where <graphic> is the generalized exponential integral function </graphic></p>
from
Hi @joschu, thanks for your feedback. I can reproduce the bug.
The HTML syntax behind Wikipedia's formulas is quite complex and it seems to confuse the extractor. This is how E_{n}()
is coded:
<span class="mwe-math-mathml-inline mwe-math-mathml-a11y" style="display: none;"><math xmlns="http://www.w3.org/1998/Math/MathML" alttext="{\displaystyle E_{n}()}">
<semantics>
<mrow class="MJX-TeXAtom-ORD">
<mstyle scriptlevel="0" displaystyle="true">
<msub>
<mi>E</mi>
<mrow class="MJX-TeXAtom-ORD">
<mi>n</mi>
</mrow>
</msub>
<mo stretchy="false">(</mo>
<mo stretchy="false">)</mo>
</mstyle>
</mrow>
<annotation encoding="application/x-tex">{\displaystyle E_{n}()}</annotation>
</semantics>
</math></span><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/293d17c59c25748cbad9dbc20020c1649a98313e" class="mwe-math-fallback-image-inline" aria-hidden="true" style="vertical-align: -0.838ex; width:4.743ex; height:2.843ex;" alt="{\displaystyle E_{n}()}">
@adbar Is there any progress on this issue? based on my testing, alt
from wikipedia formulas still cannot be parsed and kept in the output.
@szhengac Nothing new at the moment but the alt
issue you're mentionning is only loosely related to this one, can you give an example?
@adbar If you use wiki page: https://en.wikipedia.org/wiki/Condorcet_paradox with extract
, you will find the math equations are missing, e.g., the one near Using the central limit theorem, we show that