ovis
ovis copied to clipboard
Schema frameworks for LDMS store for Kafka
There has been discussion on recent calls about which Schema framework to use in the LDMS store that would write data to Kafka.
Kafka (via Confluent) comes with support for three scheam frameworks: Avro, Protocol Buffers, and JSON Schema. Avro was the first to be supported, and has been around the Kafka world the longest.
There are a number of pros and cons to each, but here are some differences that I think are worth considering:
Data types
Avro does not support unsigned integers. It onyl has signed "int" and "long" data types. While Avro supports aliases, I am not sure that they are useful to us in practice. For instance, one might create an alias named "Unsigned64" which is an alias to the primitive type "bytes" with a size of 8. But on the decoding side the type is still just "bytes", and existing Kafka pipeline code will not know hot to interpret that array of bytes.
JSON Schema may have a different issue with types. In JSON Schema, all numbers are just of type "number". A number is an arbitrary precision decimal value. While that certainly allos the values stored in C-lang uint32 and uint64, it is not clear that is enough information for a Kafka Connector sink to be able to know which type to use when storing a number in a database.
Protocol Buffers seem to have the greatest variety of standard types, including uint32 and uint64.
Schema evolution
All three of the schema frameworks support schema evolution. Protocol buffers may arguably do this the best, since fields in its schemas are entirely optional. The newest schema in protocol buffers can also deserialize messages that use older versions of the schema, as long as the change in the newer schema is backwards compatible.
With Avro the exact version of the schema is needed to deserialize a message. Schema compatibility is achieved through following a set of rules. Since fields are not optional, deprecated fields must have default values in newer schemas to maintain compatibility.
Supported languages
Protocol Buffers, at least in the standard libaries, lack support for non-object-oriented languages, like C. This may not be a deal breaker, because we can likely use C++ in the LDMS Store that implements a Kafka producer.
Avro supports C (as well as C++ and others).
Run-Time Generated Schemas
LDMS schemas are generate at run time, and the schemas are influenced by configration at various places in LDMS. For instance, both sampler and store configuration may change the contents of an LDMS schema.
One advantage of Avro, is that no code generation is needed to create and employ schemas.
Protocol Buffers are aimed at schemas that are determined at compile time. The schema is described in a proto file, and then the protoc compiler is used to generate, for intance, C++ code for the schema. It looks like the class generated by protoc will have method names that contain the names of fields in the schema.
So despite its nice features, the compile-time nature of schemas in Protocol Buffers may make them incompatible with LDMS.
Binary size notes
The size of LDMS binary representation of data vs other things like Avro also came up in recent discussions. It is interesting to note that Avro and Protocol Buffers both use variable lenght zig-zag encoding of numbers, which means that they can actually be much smaller in practice than the full LDMS metric set (meaning data and metadata). But of course, LDMS only moves metadata on the wire infrequently, so that is not an apples-to-apples comparison.
@morrone great summary, thanks Chris.
My current opinion is that Avro is our best option. Protocol buffers would have been the most attractive to me, but since we need to know our schemas at compile time with Protocol buffers, that just won't work for us. Json schema seems to be even less specific about number types than avro.
The main issue with Avro is going to be the lack of support for unsigned numbers. As a first pass, we would simply used signed types in Avro.
We might also be able to use the Avro logical type "fixed" to store unsigned values in a field of a fixed number of bytes. But I'm not sure that implementations would guess how to use those numbers.
We could perhaps make our own logical types for unsigned values. In that case, the consumers of our data will almost certainly need to have customizations to be able to interpret our custom logical types. I don't know how much effort would be needed to teach, for instance, Kafka sink connectors how to convert our logical types into values that their sink understands.
We should avoid telling people to use incorrect range mappings that will bite them at end-user time.
I haven't found json float to work adequately for epoch_sec+usec or epoch_sec+nsec timestamps, nor found json int to work for uint64_t or json float to handle bigger float/complex (yes, bigger occurs in quad-prec science apps). These need to be quoted strings instead. Yes, the json spec is suitably vague, but implementations impose precision.
Invariably we end up needing mapping information tuples: [src_name, [json/avro_type(s)], dest_name, dest_type, conversion_func_name] to deal correctly with science user expectations and backend storage happily. In the case of quad or 80 bit precision, one is very tempted to just store as string and never convert, but eventually a user will be unhappy.
We have an avro kafka store now, so we can probably set this discussion aside now.