saxy icon indicating copy to clipboard operation
saxy copied to clipboard

Handle CDATA with UTF-8 characters when partial parsing

Open tomtaylor opened this issue 1 year ago • 1 comments

Follows on from #122.

In partial mode, UTF-8 encoded characters might be split across multiple chunks. When this happens for a character such as £, which is encoded as <<0xC2, 0xA3>>, the 0xC2 is neither an ASCII character (<= 127), nor does it match the <<codepoint::utf-8>> clause, and Saxy throws a parser error.

This fixes that by just parsing all the bytes inside a CDATA element regardless of their code point. It drops the UTF-8 character optimisation, but I suspect that's probably a minor performance improvement for most documents.

@qcam is this a more prevalent issue than my use case? I can see why matching on UTF-8 codepoint and swallowing the whole character is a nice optimisation, but I wonder if it might cause issues in other places when partial parsing.

tomtaylor avatar Aug 05 '24 08:08 tomtaylor

@qcam any thoughts on this?

tomtaylor avatar Aug 13 '24 06:08 tomtaylor