trafilatura XML Parsing breaks on valid HTML

XML Parsing breaks on valid HTML

Open Jufik opened this issue 9 months ago • 5 comments

URL: https://fastapi.tiangolo.com

Versions:

Trafilatura: 1.6.2
Python: 3.10.13

When running trafilatura --output-format xml --URL https://fastapi.tiangolo.com/, this error show up:

~/.pyenv/versions/3.10.13/lib/python3.10/site-packages/trafilatura/xml.py:239: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.
  if not parent:
ERROR: Char 0x0 out of allowed range, line 1, column 2 (<string>, line 1)
Traceback (most recent call last):
  File "~/.pyenv/versions/3.10.13/lib/python3.10/site-packages/trafilatura/cli_utils.py", line 397, in examine
    result = extract(htmlstring, url=url, no_fallback=args.fast,
  File "~/.pyenv/versions/3.10.13/lib/python3.10/site-packages/trafilatura/core.py", line 1107, in extract
    return determine_returnstring(document, output_format, include_formatting, tei_validation)
  File "~/.pyenv/versions/3.10.13/lib/python3.10/site-packages/trafilatura/core.py", line 815, in determine_returnstring
    returnstring = control_xml_output(output, output_format, tei_validation, document)
  File "~/.pyenv/versions/3.10.13/lib/python3.10/site-packages/trafilatura/xml.py", line 122, in control_xml_output
    output_tree = fromstring(control_string, CONTROL_PARSER)
  File "src/lxml/etree.pyx", line 3257, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1916, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1796, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
  File "src/lxml/parser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 728, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 657, in lxml.etree._raiseParseError
  File "<string>", line 1
lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range, line 1, column 2

Outputing txt and json works.

At this point of code, the data already went through sanitization which should remove 0x00. On top of that :

>>> import requests
>>> r = requests.get("https://fastapi.tiangolo.com/")
>>> b'\x00' in r.content
False
>>>set(map(chr,r.content))
{'u', ']', 'v', '|', 'ê', '¬', '%', '-', '\x9e', '¡', '.', 'N', 'è', '\x81', '¯', 'ï', '{', '·', '¿', '°', 't', 'Ã', '@', 'z', '\x96', 'd', '7', '\x80', "'", '\x9a', 'X', ')', 'r', 'y', 'g', 'S', 'Q', ':', '\x97', '\x8e', '\x83', 'e', 'n', 'f', 'b', '\x8b', '9', '\x98', '_', 'º', ';', '\x8c', 'j', '8', 'C', 'L', '+', '\x9f', 'A', 'o', 'á', '\x87', '<', 'I', 'O', '(', '4', '¥', '/', '&', 'E', 'q', '\n', '5', 'c', 'W', '\x89', '"', 'R', 'a', 'x', '¼', 'Ñ', '?', '>', '=', '´', '¨', 'ð', 'M', '¸', 'ì', '0', 'F', '»', 'l', 'K', 'Ð', 'm', 'w', '¹', 'p', '§', 'P', '±', 'ª', '!', '\x8f', ',', '}', 'i', 'T', 'æ', 'í', 'D', '6', 'h', '$', '\x8d', '1', ' ', 'â', '\x95', '2', '[', '\x99', 's', 'U', '\xad', 'Y', 'k', '\x9c', 'J', 'µ', 'V', '#', 'G', 'B', 'Z', '3', '*', 'H'}

I don't know the implementation enough to point out where this chr comes from. Any pointer to contribute is welcome.

Nov 08 '23 08:11 Jufik

Hi @Jufik, I cannot reproduce the bug, which platform are you using?

Nov 08 '23 12:11 adbar

On an Apple M2 Max. I've dug around, seems like by-passing sanitize call in xml.control_xml_output fixes the issue locally.

Nov 08 '23 15:11 Jufik

There are sometimes problems with LXML on M1/M2 platforms. Installing trafilatura (and thus lxml) with brew could help.

We could also sanitize the output as you say.

Nov 08 '23 15:11 adbar

This ongoing PR adopts a different approach to doc sanitizing, it should also solve this problem, although I can't replicate it.

Nov 10 '23 13:11 adbar

@Jufik Is the problem solved?

Jan 26 '24 12:01 adbar

@adbar just made a test, works like a charm with 1.7.0!

Feb 04 '24 10:02 Jufik

trafilatura trafilatura copied to clipboard

XML Parsing breaks on valid HTML

trafilatura
trafilatura copied to clipboard