Support for blosc-style byte transposition and additional compressors
There's been discussion around supporting additional compression modes offered by the blosc library, particularly zstd with blosc's byte transposition filter. Supporting the transposition would require new field(s) in the ASDF block header that describes the compression block size and the fact that the bytes were transposed. We would also need to add a new 4-byte compression code for zstd.
Could we just create a block prefix area for extra metadata information that is implicit for that compressions scheme? Does it have to be explicit in the block fields? This would allow much more flexible additions in the future without having to keep changing the definition of the block structure?
I liked @Cadair's suggestion from asdf-format/asdf#775 of integrating the numcodecs library instead of just blosc or zstd. My impression was that numcodecs uses a JSON-serializable metadata dict to describe how to decompress a binary blob. Could such a dict go in the YAML that describes an ASDF block, instead of the block prefix area (which I think means the binary prefix)? I was imagining a new 4-byte compression code like "ncdx" to indicate "new style" compression that would tell ASDF to look at the numcodecs info in the YAML.
Also, would such a pattern be as part of the asdf standard, or is compression considered an implementation detail?
I agree that the equivalent of the numcodecs metadata dict should be stored that way (or allowing YAML as well). Whether it should be in the YAML header or embedded in the binary block, is something we hadn't yet settled on (e.g., prefixing the binary data with the JSON metadata).
Great. I'm happy to help with implementation, too!
I'm concerned about the impact this will have on interoperability with ASDF readers in other languages. General support for numcodecs compression is convenient for Python developers, but someone working in another language would have to write their own glue code to translate the numcodecs metadata into tools available in their language. Given the large and evolving number of possible numcodecs configurations, it would not be feasible to support them all.
The limited list of compression modes in the current standard seems useful, because any ASDF reader that fully implements the standard will be guaranteed to be able to read the raw data. On the other hand, there are clearly cases where users need more exotic compression modes! I'm not sure how best to reconcile those two needs.
How about this:
- Continue to maintain a short list of official compression modes, and recommend that ASDF files use one of those modes for the sake of interoperability
- Add a new 4-byte compression code to indicate a custom compression mode ("cust" or similar) but not tie it to numcodecs specifically
- Create some kind of plugin API in the Python library to provide general support for custom compressors
- Create a numcodecs plugin
Yes, it should be made clear that whatever is added has to be reasonably easy to support in other languages. I agree that we don't necessarily imply all that is in numcodecs is supported by ASDF. It can be used by the python library for convenience, but it shouldn't set the standard. I tend to think the metadata that it generates for things that are language neutral, probably is pretty generic, but we would have to check that.
To the extent that it is supported by BLOSC, it does seem more generic since that is implemented in C. Still, that would present problems for other languages that don't interface with C well. And for the purposes of archival support, the choices of compression technologies should be limited to those that would be around a long time.
@eslavich Any thoughts about where the metadata ought to go? If in the YAML header, it would require extending the ndarray schema to allow for such an object to aid in any compression scheme. Right now all that information is in the binary block and decoupled from the ndarray information. Putting it in the binary preserves the decoupling.
The "cust" tag + plugin approach sounds reasonable to me. I wonder if the plugin API should be implemented as a thin wrapper to numcodec's "registry", since it provides this kind of functionality already: https://numcodecs.readthedocs.io/en/stable/registry.html#numcodecs.registry.register_codec
On the other hand, I agree that it would be nice if numcodecs were optional in the way that lz4 is now. But maybe that would be the case: unless a "cust" tag is seen, no need to invoke numcodecs. Still thinking this through...
Re metadata in the block vs the tree: I can see arguments for both. The ndarray objects in the tree are views over the block data, and we support multiple views per block, so it makes sense to centralize the compression metadata in the block itself. This also works well with our scheme to reference blocks in one ASDF file from another.
But, the block header is currently a small and fixed-length 54 bytes. The fixed length is a nice property but it seems like we'd need to give that up to support additional custom metadata. I don't think we'd want to reserve a fixed length metadata region since that would significantly increase the size of files that utilize many small blocks.
Does anything prevent a second level, variable-length prefix? The id in the block header can indicate that there is a prefix to the data in the data section, the first word of that indicating the length of the prefix. And that prefix would hold the JSON/YAML string with the necessary information.
I do kind of like the idea of having the compression data in the tree, because it lets one skim the header and understand what kind of compression is involved in reading the file, giving an idea of how CPU intensive reading the file is going to be, how much it's likely to inflate, etc. But those are minor convenience factors, there would be programmatic ways to determine that info if it were actually important.
Does anything prevent a second level, variable-length prefix? The id in the block header can indicate that there is a prefix to the data in the data section, the first word of that indicating the length of the prefix. And that prefix would hold the JSON/YAML string with the necessary information.
If we're going to go that far, we could make the entire block header a mini YAML document...
I suppose, but for most uses that seems like overkill, and inefficient in that YAML parsing would be needed, even for small arrays.
Another option is to require the custom compressor to be configured at the file level. Then, any blocks in the file with cust would have to be compressed with the same mode. Would that be too limiting?
Perhaps. One scheme is to have the file level configuration have some ID that can be used in the block header allowing multiple compression options within the same file. It's a bit brittle in that the number is arbitrary and if the block is separated from the YAML, it is rendered useless.
Also true for only one compression scheme though.
I think having all blocks be compressed with the same algorithm is okay, but compressors like blosc need to know the type width (e.g. 4 bytes for float, 8 for double) on a per-block basis. This information actually doesn't need to be recorded at the metadata level, because blosc encodes it in its own header, but users would need a way to control this on a per-block level during compression.
Of the options so far, I find the variable length block header field the most appealing. It's nice to let the plugin format the metadata however it wishes, and headers for blocks with non-custom compression would only gain 2 bytes.
It probably is the most robust as well.
Just so I understand, is the 2 bytes to indicate the length of the header (i.e. up to 64K)? Or something else?
Just so I understand, is the 2 bytes to indicate the length of the header (i.e. up to 64K)? Or something else?
Yes, if I understand @perrygreenfield's idea correctly. For non-custom compressors it would always be 0x0000.
If the library sees multiple compression plugins installed, how would it determine which one to use? Or would users only be permitted to install one at a time?
I suppose the plugin API could have a method that answers the question, "does this compression metadata belong to you?", and plugins would have to figure out for themselves a reasonable method of distinguishing their own metadata.
That's part of why I was leaning towards a wrapper to numcodecs as the plugin mechanism. It already has the codec ID mechanism to identify the compression algorithm for the data. And it doesn't even have to be an ID that numcodecs natively supports, since one can register arbitrary codecs.
ASDF could reimplement equivalent functionality to avoid the numcodecs dependency, but I'm not sure it's worth it, especially if the "dependency" is only a requirement if a cust tag is found (much like lz4 is implemented in ASDF now).
Perhaps the right thing to do is define a clear mechanism for custom binary formats that aren't part of the standard. Nonstandard YAML tags are not a big obstacle since they are still very transparent as far as all the contents go. The same is not true for binary variants. So something in the YAML header that declares the optional binary standard that is being supported is being used that will signal that in order to understand some or all of the binary contents, software that supports that optional binary standard is needed. These compression options would go under that scheme. People that wish to use the format as an archival format would be discouraged from using these unless these binary options move into the standard itself, or in a domain, it is understood that this is a standard format, etc. If this format is successful, it is going to have to accommodate such variants (IMHO).
I think the general direction of this discussion is good and supporting custom compression is a good idea. Are there any compression types inside the larger numcodecs set which are worth singling out of inclusion in the standard? (The Python implementation could use numcodes, but they would also need to be accessible to other languages?)
Hmmm, specifically Pickle, but making such a long laundry list a required part of the standard just seems unreasonable to me. "You don't meet the standard unless you support all of these" is too high a burden. On the other hand, I'm not against allowing adding to the compression options, but it shouldn't be understood that you must support that option to support ASDF. So a good way of documenting what binary options are being used would be good to let those trying to load such a file so that they can check whether the installed software supports that option.
Another aspect of doing that is to be careful not to encode into the file peculiar implementation aspects that are especially targeted to one language. Generally Python libraries are reasonable, but if they define attributes that are clearly Python based, then the file should find a different representation for that and the Python ASDF package would translate that to what the library understands. I haven't looked at all the JSON forms that numcodec uses, and it may be that some of them ought to be translated a bit. Or maybe not. I don't know yet without looking.
I was thinking about that too. I've poked around a bit, and the only JSON field that numcodecs mandates is id, which is pretty uncontroversial. The rest of the JSON is determined by each compression algorithm, so in theory, they could bake in some Python-specific details. I haven't seen anything too bad, but one example might be Delta compression, where the dtype field uses Numpy-style <i4, <f8, etc: https://numcodecs.readthedocs.io/en/stable/delta.html
Not particularly damning, as those type codes make sense even outside of a Numpy context. But there could be worse examples lurking.
@eslavich and I have been discussing this a bit the last day and I think we are starting to converge on how we would like to go ahead with this sort of thing. Something along the lines of defining a unique URI for the binary transformation scheme (presumably compression, but perhaps other kinds can be conceived of, one might be encryption, though we don't yet know why one wouldn't encrypt the whole file instead of one binary element). There would be a top level ASDF attribute specifying any binary transformation requirements needed to interpret all the binary blocks in the file (perhaps only one out of many need it). These requirements would be a list of the URIs needed for the transformation. An ASDF loader then could see if had software corresponding to those URI's. If not, any binary block that uses it would be returned as a "untransformed" binary block (perhaps the user can find another way of decoding it). In other words, not having it doesn't prevent the file from being opened and all other contents available.
If the ASDF package does have corresponding software to handle that URI, it uses it to transform the block. There are still some details to work out, but I favor using a special code in the block header saying that the binary data has a transform to apply (@eslavich points out that it may be a little more involved if we want to apply a series of transforms so what I'm describing here is too simple, or maybe not, now that I think about it). The next word in the block data indicates the length of the internal header, in JSON or YAML format (TBD) that will list the URIs that are to be applied, in order they are listed. The information associated with the URI is the metadata that it will use to do the transform.
The object referencing the binary data is ignorant of the transform applied. It only uses the transformed data that is returned. E.g., numpy arrays can use any number of these custom transforms, perhaps different ones in different blocks, but is unaware of what was used.
We would like to keep these compression options out of the minimal standard requirements. This allows us or anyone to supply optional compression algorithms without requiring the standard to adopt it as standard. If widely adopted, it can be endorsed by the standard, but as an option. That is for later though. So the specific compression algorithms don't need to be in the standard V2.0, but the mechanism for supporting them does. This way we can supply support for nearly all the numcodec options.
Thoughts?
More work on the Python interface for specifying the compression to be applied when creating a binary entity is needed, but that should be fairly straightforward. I'll be taking off most of this afternoon...