quick-xml icon indicating copy to clipboard operation
quick-xml copied to clipboard

Support for embedded raw binary

Open elrnv opened this issue 1 year ago • 2 comments

There are use cases where embedding binary data into an XML document is convenient. Although not officially supported by the XML spec there are valid and popular uses that exist in the wild (e.g. VTK). Binary VTK files are quite common, in fact the <AppendedData> tag (which can contain binary data) is the default when writing vtk files, and there are plenty of files like this in the wild.

In the past quick-xml was able to support binary data, but this has changed and in the current state requires some non-trivial changes for adding support for being able to write and read uninterpreted binary data.

My first attempt didn't really address the issue correctly since binary data can have valid '<' and '>' tags, which can trip up the parser, meaning that at least reading must be augmented from the lowest level (probably as part of Reader).

As discussed previously the implementation for this should work as an additional feature.

Reading

I think it is not difficult to add a customization point for the reader in quick-xml without being too invasive by adding a "context" field (supplied externally) to Reader that can call ask how many bytes should be read by feeding it context about what has already been read.

Writing

All types in the serialize part of the code is bound by fmt::Write. I haven't found a way to maintain that bound everywhere and also support for writing binary files to an optional or custom io::Write type. To enable this I think we have to switch back to io::Write. I understand that there are some performance concerns about having to write to a buffer if io::Write is used, but I don't see another way of adding support for binary writing without io::Write. I can really use some help from the maintainers to address this point.

elrnv avatar Jul 16 '23 05:07 elrnv

It would be valuable, if you could link or attach some examples of files and provide links to the explanation, how VTK parser should understand that some byte is not a markup, but a part of a binary data.

I found an archive with some files in https://github.com/elrnv/vtkio/issues/27 and fight now I suspect the following algorithm for resolving ambiguity:

  • when read opening tag with a special mark (<AppendedData encoding="raw"> in an example) the parser turned into special binary reading mode
  • in that mode any sequence of bytes are allowed until it contains the closing tag sequence (</AppendedData> in an example)
  • closing tag sequence should be interpreted in the document encoding (utf-8 in an example)

So I think we could work on this issue in a few steps:

  • Implement a way to read binary data. I think, it should be a special method(s) of a Reader that should be called after consuming Event::Start(r#"<AppendedData encoding="raw">"#) and fill a provided buffer with the data and return a Span. The special Event::Binary is unlikely to be needed at this point, because there is no way to generically understand that we should read binary data instead of a text data. But we could think about a callback that will take &BytesStart and return a parsing mode. I think, that this part could be available without feature flags
  • Think about API for a Writer. Actually, I think, that we need to split the current Event into reader::Event and writer::Event, because this could benefit for this and other cases
  • Think about serde API and implementation

Mingun avatar Jul 16 '23 17:07 Mingun

Thank you for the quick response!

It would be valuable, if you could link or attach some examples of files and provide links to the explanation, how VTK parser should understand that some byte is not a markup, but a part of a binary data.

Here are a few examples:

  • a file in the wild: https://github.com/elrnv/vtkio/files/11272859/fluid_1_91.zip
  • binary VTK exported with Paraview: https://github.com/elrnv/vtkio/blob/master/assets/hexahedron_binary.vtu
  • a binary VTK with compressed bytes: https://github.com/elrnv/vtkio/raw/master/assets/hexahedron_lz4.vtu

There doesn't seem to be any documentation on how to parse this kind of binary section, just on how to decode it. The official vtk file format is documented here The binary section is always prefixed with an underscore, and the first few bytes determine how large the binary section is.

  • when read opening tag with a special mark (<AppendedData encoding="raw"> in an example) the parser turned into special binary reading mode
  • in that mode any sequence of bytes are allowed until it contains the closing tag sequence (</AppendedData> in an example)
  • closing tag sequence should be interpreted in the document encoding (utf-8 in an example)

That makes sense to me, and should be robust enough to support practically all such VTK files with embedded binary blobs. Although, I think ideally this logic is best left in vtkio, here it seems we should just provide a way to do context dependent parsing that is general enough for this use case.

So I think we could work on this issue in a few steps ...

This makes sense to me, but for the writer, I am still in the dark as to how io::Writer should be supported throughout the serde modules.

It is easiest for me to prototype a complete solution for reading and writing and test it directly using the tests I already have in vtkio. I am happy to start a PR and try to separate the reading and writing changes though to make it easier to review.

elrnv avatar Jul 16 '23 18:07 elrnv

Probably reading binary data is possible today, if you get inner reader when you've ready to read binary and read from it. The possible drawback is that internal offsets in the Reader won't be updated, so you will need to track read bytes and correct result of .error_position() / .buffer_position() accordingly.

Mingun avatar Jul 11 '24 20:07 Mingun

Would this also work from the Serde API?

elrnv avatar Jul 13 '24 20:07 elrnv

No, serde requires more work. Suggested approach lefts to the user the decision where to stop reading binary data and return to reading XML events. In serde API you should somehow to tell the Reader these things. That is subject for design. I do not plan to design such API in the near time, so you can try to do that yourself.

Mingun avatar Jul 13 '24 20:07 Mingun

Just to get some clarification, am I correct in my understanding that this would also allow proper handling of embedded DSLs or de facto embedded DSLs (eg. HTML's <textarea>, <script>) as long as the apparent lack of a call to push back/un-read data is compensated for by disabling check_end_names and matching the closing tag as if the it was part of the "binary data" and the "binary data" was designed to be self-terminating?

(I know that, given a restricted HTML subset that lacks those "embedded DSL elements" but contains self-closing tags like <br>, such as a Pidgin chatlog, older versions of quick-xml can be used to parse it an order of magnitude more quickly than with html5ever by turning off paired tag enforcement and enforcing expected well-formedness rules manually.)

ssokolow avatar Aug 01 '24 02:08 ssokolow

Yes, you understand correctly. You even don't need to disable check_end_names because when you read from .stream(), you are effectively cut some data from seeing by XML parser. As long as not cutted content represent valid XML with proper nesting everything will work smoothly.

Mingun avatar Aug 01 '24 07:08 Mingun

Yes but, unless the cut content is length-prefixed, which embedded DSLs generally aren't, there'd need to be some way to "rewind" the stream to put the closing tag back into what the XML parser sees after the DSL parser has seen it. Hence, disabling check_end_names.

ssokolow avatar Aug 01 '24 07:08 ssokolow

While that may be needed in some cases, I think, that usually you shouldn't do that. .stream() returns an impl BufRead, so you can lookahead without consuming data. After all, it wraps the type that you pass to the from_reader method, so you can pass such a type which will able to buffer arbitrary amount of data without consuming them.

Mingun avatar Aug 01 '24 15:08 Mingun

Huh. You're right... and it looks like I'm not the only person who was confused by the naming and documentation of fill_buf() into thinking that BufRead had no peek().

(Sorry for the delayed response. It took me a couple of days to realize that I was remembering a response I'd intended to send instead of one I'd actually sent.)

ssokolow avatar Aug 03 '24 14:08 ssokolow