saxy
saxy copied to clipboard
CDATA element fails to parse when element contains £ symbol
We have an XML file which is failing to parse since we switched from calling File.stream!(path, [:compressed, :trim_bom])
to File.stream!(path, [:compressed, :trim_bom], 32_768)
. It throws the following error:
{:error, %Saxy.ParseError{reason: {:token, :"]]"}, binary: <<10, 60, 100, 101, 115, 99, 114, 105, 112, 116, 105, 111, 110, 62, 60, 33, 91, 67, 68, 65, 84, 65, 91, 60, 112, 62, 60, 115, 116, 114, 111, 110, 103, 62, 83, 65, 80, 32, 124, 32, 46, 78, 69, 84, 32, 124, ...>>, position: 92}}
The file is littered with empty CDATA elements. I wonder if one of those is aligning with the start/end of a buffer? I can provide the full XML file if useful - it's 114MB and I'd prefer not to provide it publicly.
I've managed to pull together a minimal Elixir script and sample XML that reproduces this error: https://gist.github.com/tomtaylor/2220e932140611e44318921040be18fe
I don't think it's specifically related to streaming, but it does seem to be about a chunk aligning with a CDATA tag.
The test case uses Saxy.Partial
and throws the following error:
{:error, %Saxy.ParseError{reason: {:token, :"]]"}, binary: <<10, 60, 115, 97, 108, 97, 114, 121, 84, 111, 62, 60, 33, 91, 67, 68, 65, 84, 65, 91, 194>>, position: 20}}
I believe the file is valid. I've run xmllint --valid --noout sample.xml
against it and it looks fine, apart from the missing DTD, which I don't think Saxy cares about.
Let me know if I can give you any more information. Thanks!
Looking again with fresh eyes this morning I can see that it fails when the chunk passed to Saxy.Partial.parse
doesn't contain the full CDATA close element (]]
), only the first character of one (]
). In my example, it received the following chunks:
-
ed><![CD
-
ATA[2023
-
-11-15 0
-
2:16:59]
<- this blows up
CDATA tags might be one of the few elements in XML where it's a multi character token, and so I imagine the streaming parser is getting tripped up on only seeing part of a token. Does that seem plausible?
@qcam any thoughts on this? You should have a full reproducible example linked above, but let me know if I can provide any more context.
Hi again @qcam - is there anything else we can do to help with this issue? I've poked around the code base to see if there's an obvious place to fix, but it's eluding me. There's a minimal reproducible example in the post above: https://gist.github.com/tomtaylor/2220e932140611e44318921040be18fe
I've made a bit of progress on this. This is failing when Saxy.Parser.Builder.element_cdata
is receiving a non breaking space character (decimal codepoint 194). This isn't matched by the is_ascii
guard, nor <<codepoint::utf8>>
.
e.g.
buffer = <<194>>
case buffer do
<<codepoint::utf8>> <> rest -> dbg(codepoint)
end
This will throw a CaseClauseError
.
OK, I think I fully understand this now. The £ symbol is encoded as <<0xC2, 0xA3>>
. When parsing the file, if a chunk of data cuts off at 0xC2
, then the parser will choke on it, because it's neither an ASCII char < 127, or a UTF-8
codepoint. I've added a PR in #133 which I think fixes this.