saxy icon indicating copy to clipboard operation
saxy copied to clipboard

CDATA element fails to parse when element contains £ symbol

Open tomtaylor opened this issue 1 year ago • 6 comments

We have an XML file which is failing to parse since we switched from calling File.stream!(path, [:compressed, :trim_bom]) to File.stream!(path, [:compressed, :trim_bom], 32_768). It throws the following error:

{:error, %Saxy.ParseError{reason: {:token, :"]]"}, binary: <<10, 60, 100, 101, 115, 99, 114, 105, 112, 116, 105, 111, 110, 62, 60, 33, 91, 67, 68, 65, 84, 65, 91, 60, 112, 62, 60, 115, 116, 114, 111, 110, 103, 62, 83, 65, 80, 32, 124, 32, 46, 78, 69, 84, 32, 124, ...>>, position: 92}}

The file is littered with empty CDATA elements. I wonder if one of those is aligning with the start/end of a buffer? I can provide the full XML file if useful - it's 114MB and I'd prefer not to provide it publicly.

tomtaylor avatar Sep 18 '23 16:09 tomtaylor

I've managed to pull together a minimal Elixir script and sample XML that reproduces this error: https://gist.github.com/tomtaylor/2220e932140611e44318921040be18fe

I don't think it's specifically related to streaming, but it does seem to be about a chunk aligning with a CDATA tag.

The test case uses Saxy.Partial and throws the following error:

{:error, %Saxy.ParseError{reason: {:token, :"]]"}, binary: <<10, 60, 115, 97, 108, 97, 114, 121, 84, 111, 62, 60, 33, 91, 67, 68, 65, 84, 65, 91, 194>>, position: 20}}

I believe the file is valid. I've run xmllint --valid --noout sample.xml against it and it looks fine, apart from the missing DTD, which I don't think Saxy cares about.

Let me know if I can give you any more information. Thanks!

tomtaylor avatar Nov 20 '23 21:11 tomtaylor

Looking again with fresh eyes this morning I can see that it fails when the chunk passed to Saxy.Partial.parse doesn't contain the full CDATA close element (]]), only the first character of one (]). In my example, it received the following chunks:

  • ed><![CD
  • ATA[2023
  • -11-15 0
  • 2:16:59] <- this blows up

CDATA tags might be one of the few elements in XML where it's a multi character token, and so I imagine the streaming parser is getting tripped up on only seeing part of a token. Does that seem plausible?

tomtaylor avatar Nov 21 '23 12:11 tomtaylor

@qcam any thoughts on this? You should have a full reproducible example linked above, but let me know if I can provide any more context.

tomtaylor avatar Jan 15 '24 17:01 tomtaylor

Hi again @qcam - is there anything else we can do to help with this issue? I've poked around the code base to see if there's an obvious place to fix, but it's eluding me. There's a minimal reproducible example in the post above: https://gist.github.com/tomtaylor/2220e932140611e44318921040be18fe

tomtaylor avatar Jul 22 '24 07:07 tomtaylor

I've made a bit of progress on this. This is failing when Saxy.Parser.Builder.element_cdata is receiving a non breaking space character (decimal codepoint 194). This isn't matched by the is_ascii guard, nor <<codepoint::utf8>>.

e.g.

buffer = <<194>>

case buffer do
  <<codepoint::utf8>> <> rest -> dbg(codepoint)
end

This will throw a CaseClauseError.

tomtaylor avatar Aug 04 '24 13:08 tomtaylor

OK, I think I fully understand this now. The £ symbol is encoded as <<0xC2, 0xA3>>. When parsing the file, if a chunk of data cuts off at 0xC2, then the parser will choke on it, because it's neither an ASCII char < 127, or a UTF-8 codepoint. I've added a PR in #133 which I think fixes this.

tomtaylor avatar Aug 05 '24 09:08 tomtaylor