quick-xml
quick-xml copied to clipboard
Support for embedded raw binary
There are use cases where embedding binary data into an XML document is convenient. Although not officially supported by the XML spec there are valid and popular uses that exist in the wild (e.g. VTK). Binary VTK files are quite common, in fact the <AppendedData> tag (which can contain binary data) is the default when writing vtk files, and there are plenty of files like this in the wild.
In the past quick-xml
was able to support binary data, but this has changed and in the current state requires some non-trivial changes for adding support for being able to write and read uninterpreted binary data.
My first attempt didn't really address the issue correctly since binary data can have valid '<' and '>' tags, which can trip up the parser, meaning that at least reading must be augmented from the lowest level (probably as part of Reader
).
As discussed previously the implementation for this should work as an additional feature.
Reading
I think it is not difficult to add a customization point for the reader in quick-xml
without being too invasive by adding a "context" field (supplied externally) to Reader that can call ask how many bytes should be read by feeding it context about what has already been read.
Writing
All types in the serialize part of the code is bound by fmt::Write
. I haven't found a way to maintain that bound everywhere and also support for writing binary files to an optional or custom io::Write
type. To enable this I think we have to switch back to io::Write
. I understand that there are some performance concerns about having to write to a buffer if io::Write
is used, but I don't see another way of adding support for binary writing without io::Write
. I can really use some help from the maintainers to address this point.
It would be valuable, if you could link or attach some examples of files and provide links to the explanation, how VTK parser should understand that some byte is not a markup, but a part of a binary data.
I found an archive with some files in https://github.com/elrnv/vtkio/issues/27 and fight now I suspect the following algorithm for resolving ambiguity:
- when read opening tag with a special mark (
<AppendedData encoding="raw">
in an example) the parser turned into special binary reading mode - in that mode any sequence of bytes are allowed until it contains the closing tag sequence (
</AppendedData>
in an example) - closing tag sequence should be interpreted in the document encoding (
utf-8
in an example)
So I think we could work on this issue in a few steps:
- Implement a way to read binary data. I think, it should be a special method(s) of a
Reader
that should be called after consumingEvent::Start(r#"<AppendedData encoding="raw">"#)
and fill a provided buffer with the data and return aSpan
. The specialEvent::Binary
is unlikely to be needed at this point, because there is no way to generically understand that we should read binary data instead of a text data. But we could think about a callback that will take&BytesStart
and return a parsing mode. I think, that this part could be available without feature flags - Think about API for a
Writer
. Actually, I think, that we need to split the currentEvent
intoreader::Event
andwriter::Event
, because this could benefit for this and other cases - Think about serde API and implementation
Thank you for the quick response!
It would be valuable, if you could link or attach some examples of files and provide links to the explanation, how VTK parser should understand that some byte is not a markup, but a part of a binary data.
Here are a few examples:
- a file in the wild: https://github.com/elrnv/vtkio/files/11272859/fluid_1_91.zip
- binary VTK exported with Paraview: https://github.com/elrnv/vtkio/blob/master/assets/hexahedron_binary.vtu
- a binary VTK with compressed bytes: https://github.com/elrnv/vtkio/raw/master/assets/hexahedron_lz4.vtu
There doesn't seem to be any documentation on how to parse this kind of binary section, just on how to decode it. The official vtk file format is documented here The binary section is always prefixed with an underscore, and the first few bytes determine how large the binary section is.
- when read opening tag with a special mark (<AppendedData encoding="raw"> in an example) the parser turned into special binary reading mode
- in that mode any sequence of bytes are allowed until it contains the closing tag sequence (</AppendedData> in an example)
- closing tag sequence should be interpreted in the document encoding (utf-8 in an example)
That makes sense to me, and should be robust enough to support practically all such VTK files with embedded binary blobs. Although, I think ideally this logic is best left in vtkio
, here it seems we should just provide a way to do context dependent parsing that is general enough for this use case.
So I think we could work on this issue in a few steps ...
This makes sense to me, but for the writer, I am still in the dark as to how io::Writer
should be supported throughout the serde modules.
It is easiest for me to prototype a complete solution for reading and writing and test it directly using the tests I already have in vtkio
. I am happy to start a PR and try to separate the reading and writing changes though to make it easier to review.
Probably reading binary data is possible today, if you get inner reader when you've ready to read binary and read from it. The possible drawback is that internal offsets in the Reader
won't be updated, so you will need to track read bytes and correct result of .error_position()
/ .buffer_position()
accordingly.
Would this also work from the Serde API?
No, serde requires more work. Suggested approach lefts to the user the decision where to stop reading binary data and return to reading XML events. In serde API you should somehow to tell the Reader these things. That is subject for design. I do not plan to design such API in the near time, so you can try to do that yourself.
Just to get some clarification, am I correct in my understanding that this would also allow proper handling of embedded DSLs or de facto embedded DSLs (eg. HTML's <textarea>
, <script>
) as long as the apparent lack of a call to push back/un-read data is compensated for by disabling check_end_names
and matching the closing tag as if the it was part of the "binary data" and the "binary data" was designed to be self-terminating?
(I know that, given a restricted HTML subset that lacks those "embedded DSL elements" but contains self-closing tags like <br>
, such as a Pidgin chatlog, older versions of quick-xml can be used to parse it an order of magnitude more quickly than with html5ever by turning off paired tag enforcement and enforcing expected well-formedness rules manually.)
Yes, you understand correctly. You even don't need to disable check_end_names
because when you read from .stream()
, you are effectively cut some data from seeing by XML parser. As long as not cutted content represent valid XML with proper nesting everything will work smoothly.
Yes but, unless the cut content is length-prefixed, which embedded DSLs generally aren't, there'd need to be some way to "rewind" the stream to put the closing tag back into what the XML parser sees after the DSL parser has seen it. Hence, disabling check_end_names
.
While that may be needed in some cases, I think, that usually you shouldn't do that. .stream()
returns an impl BufRead
, so you can lookahead without consuming data. After all, it wraps the type that you pass to the from_reader
method, so you can pass such a type which will able to buffer arbitrary amount of data without consuming them.
Huh. You're right... and it looks like I'm not the only person who was confused by the naming and documentation of fill_buf()
into thinking that BufRead
had no peek()
.
(Sorry for the delayed response. It took me a couple of days to realize that I was remembering a response I'd intended to send instead of one I'd actually sent.)