cbor-js
cbor-js copied to clipboard
Pluggable encoders and decoders for tagged values
There is a draft RFC to be published soon which adds support for packed/typed arrays in CBOR, see the current version. As one of the original CBOR authors is also an author of this proposal I'm sure it will gain widespread adoption in encoders and decoders, especially because it is such a useful feature and will help greatly when big packed arrays need to be efficiently encoded and also decoded both in terms of space usage and speed.
I thought it would be good putting it on the radar for implementation. I think it will be rather straightforward to implement. I currently don't have time for it myself in the next 3-4 weeks but I certainly will help where I can, reviewing code etc.
Some notes for an implementation:
- I would only encode actual JS typed arrays as CBOR typed arrays. Scanning regular arrays and checking if they are suitable would be too time-consuming.
- For decoding you may argue that you really only want JSON compatible JS objects (without typed arrays), so there may be an option for decoding where you can switch to the typed-array mode which means that CBOR typed arrays are directly mapped to efficient JS typed arrays.
ATM cbor.js ignores all tagged vales and lets the user handle the tags. IANA lists many other tags which are not support by this library. Do we really want to implement them all? If not, who defines the "good" subset? I'm not against supporting tags, but I would see a more flexible solution for supporting tags than putting everything into the core code. Do you have any good ideas?
Ok, I understand your concerns and I agree it may make more sense to have a modular approach, especially since this is a JS implementation and the final code size should be minimal in regards to what someone actually needs in functionality. I would even say that the decoder should be separate from the encoder, since in certain cases you only need one of both. For example, you could create two new packages cbor-js-encode and cbor-js-decode, and the current cbor-js package becomes a meta/convenience package which just includes both as dependencies and reexports encode/decode.
As for the tags, I see that the decode() function has a tagger
argument which accepts the value and the tag number. So, let's assume I create my own package cbor-js-decode-typedarrays which supplies such a tagger function for the typed array tags (and which just returns the original value for unknown tags). It seems quite easy to use, just import both packages and invoke decode with the typed array tagger. So far so good, but what if other taggers should be used as well? How to combine them easily? I think to solve this, each tagger should advertise the tags it supports. Then you could accept an array of such taggers in decode(val, [tagger1,tagger2]) and internally create a map from tag number to tagger for efficient access. If many such taggers are used there is probably some overhead, so it may be good to also allow some pre-computed map as argument, like var taggers = Taggers([tagger1, tagger2])
and then decode(val, taggers)
. I think a tagger could then look something like:
var TypedArrayTagger = function(val, tag) {
return ... // as typed array of correct type
}
TypedArrayTagger.tags = [];
for (var i = 64; i <= 87; i++) {
TypedArrayTagger.tags.push(i);
}
I think I would be happy with that. And if someone at one point wants to create a super-cbor library, he can just import all taggers that exist, and proxy the decode function so that the taggers are supplied implicitly.
What about encoding? My personal use case doesn't involve encoding on JS side so I'm less interested in that. I think it's ok to leave that out for now as it looks like it can be cleanly added in the future if there is some need.
What do you think?
What would be the benefit of splitting the cbor package into meta+encode+decode?
What would you think about a decoder-registry:
var typedArrayTags = [];
for (var i = 64; i <= 87; i++) {
typedArrayTags.push(i);
}
CBOR.registerDecoder(function(val, tag {
return ... // as typed array of correct type
}, typedArrayTags);
an alternative call of the function could directly accept your TypedArrayTagger
function (and uses the tags
property):
CBOR.registerDecoder(TypedArrayTagger)
If the TypedArrayTagger
comes from it's own package it should allow easy code reuse:
var cbor = require('cbor');
cbor.registerDecoder(require('cbor-typedarraytagger'));
cbor.decode(...);
About the encoder: I thought about using an object for that:
var TaggedValue = CBOR.TaggedValue;
CBOR.encode({
normalValue: "example",
taggedValue: new TaggedValue(32, "http://example.com") // URI; see Section 2.4.4.3 in RFC7049
});
Splitting: If your web application only needs to decode, then why should it include the encoder as well? With separate packages you could only require what you need.
As for the registry, I'm slightly against this, as it means to have global state and complicates things like testing (if you register a decoder in one test, it will appear in another test if you don't clean it up somehow). It also prevents parallel testing of different versions of a decoder (for the same tags). But I absolutely see the value of being able to have a nice interface (decode(val)) without much effort. I guess an alternative would be to make the CBOR object a class:
var cbor = new require('cbor')();
var obj = cbor.decode(v);
Or with some taggers registered:
var cbor = new require('cbor')({
'decoders': [require('cbor-typedarraytagger'), ...]
});
var obj = cbor.decode(v);
For encoding, that's possible, but not very straightforward if I think of the many typed array tags. Then I would need some extra machinery which transforms the typed arrays into such TaggedValue objects. It would be nice to have something similar as for decoders, but it may be hard to do.
Splitting: IMHO the overhead of downloading 2 or 3 files compared to one when you need encoder and decoder is more relevant than the file size (4KB in the minified version).
Global state: I completely agree, but how often do you really encounter this problems in practice? IMHO it's more a theoretical problem, since real parallel testing requires more than one interpreter (since JS is singe threaded by design) and then you can have also two global objects. IMHO having an easy API is more important than supporting rare cases. And if you really need it you could always create a copy of the CBOR object before registering the decoders. We could also think about having a CBOR.registry
object with registerDecoder
function, which could than passed to the CBOR.encode
and CBOR.decode
functions as additional optional argument.
Encoder: Thanks for pointing that out. What do you think about registering the constructors:
CBOR.registerEncoder(Uint8Array, function(val) {
return {
tag: 64,
value: new Uint8Array(val.buffer) // this Uint8Array makes it a CBOR bytestring
};
});
Splitting: IMHO the overhead of downloading 2 or 3 files compared to one when you need encoder and decoder is more relevant than the file size (4KB in the minified version).
Right, what I had in mind is that there would be a build step for the meta library which then produces a single merged and minified file for each release. Given given the small size of 4kb... you're right, let's not overcomplicate things.
Global state: I completely agree, but how often do you really encounter this problems in practice? IMHO it's more a theoretical problem, since real parallel testing requires more than one interpreter (since JS is singe threaded by design) and then you can have also two global objects. IMHO having an easy API is more important than supporting rare cases. And if you really need it you could always create a copy of the CBOR object before registering the decoders. We could also think about having a |CBOR.registry|object with |registerDecoder| function, which could than passed to the |CBOR.encode| and |CBOR.decode| functions as additional optional argument.
What about the other way around? By default, the CBOR.encode and decode functions could use the global registry. And if you don't want that, you can supply your own registry as optional parameter. Maybe that's what you actually meant?
Encoder: Thanks for pointing that out. What do you think about registering the constructors:
Interesting idea. I think it's a good start. I can't think of any serious issues with that.
What about the other way around? By default, the CBOR.encode and decode functions could use the global registry. And if you don't want that, you can supply your own registry as optional parameter. Maybe that's what you actually meant?
Yes, the global registry should be used as default if not passes as argument.
Cool, so we came to a common agreement :) I will start creating a decoder package for the typed arrays soon.
I started writing a typedarray decoder, based on the decoder object format we discussed: https://github.com/neothemachine/cborjs-typedarrays-decoder It's not done yet and needs testing. When do you think you can integrate the registerDecoder function?
I think I have a working implementation for the decoder now. I found it strange to work directly with a function that has a tags property attached to it. It doesn't reflect any common model to do these things. Since we don't need state, I propose to change it to a simple object with tags
and decode
keys, where tags
is the array of supported tags, and decode
is the decode function.
Sorry for the long delay, but finally I did a first implementation of it in the registerDecoder branch. If you agree on the API, I'll add the missing test and publish a new version.
Cool. Looks good, just one thing missing. Remember we talked about default registries and optionally supplying a custom registry to encode() decode()? I think it is easy to extend your code for that. Basically rename the registries to defaultEncoderRegistry and defaultDecoderRegistry and then add another argument to encode() and decode(). I guess to make it consistent the register and unregister functions should also have a registry parameter which defaults to the default registries.
In decode(), what is the second parameter "simpleValue" good for? Looks like some fallback for unknown "additional information", but I don't see that this is defined in the CBOR spec really. I think this should be removed. Or am I missing something here?
I implemented the UNregister functions, for doing the testing stuff you wrote about. Can you tell me more about the use-case for different registries? To me it seams strange that the same tagged value can result in different decoded values in the same project.
CBOR does not use all possible values for simpleValues (e.g. true, false) for now, but a CBOR value could contain a (not yet) defined value. Instead of directly throwing an error, i call this simpleValue function, where users can throw their own value (if they want to).
Ok, let's leave it as it is. It's not a problem, probably just me not liking global stuff again :p
me neither, but thinking about an additional registry parameter seams even worse to me, but maybe i find a nice solution when finishing the code and documentation. Does it fit your requirements now? Do you think the two different functions are ok, or do we need a registerEncoderAndDecoder() too?
No I think the two functions are fine. I think in most cases you don't need both encoding and decoding anyway, and if someone needs it it's not a big deal. From my side it's fine.
Ping. How are you getting on here? Do you need some help?
Wondering the state of this. Has the tags supported been added?
If we have to encode / decode with custom tags, how to specify the tag handlers?