apicurio-registry
apicurio-registry copied to clipboard
Support avro single object encoding / resolve schema by fingerprint
Avro supports single object encoding, where the payload is stored in a bytearray which is prefixed with a fingerprint id. This allows very little overhead but still keeps the schema information on the serialized bytes. The fingerprint is a long id that is derived directly from the schema using a hash (that changes with every schema update).
When you deserialize such a bytearray, you need to read the fingerprintId and use a SchemaStore (aka the registry) to load the exact schema version that was used to write the data.
I want to use apicurio schema registry, but it is hard to query efficiently by a fingerprintId. It can not be used as artifactId, as it would change with every change and not allow versioning. The workaround solution would be to manually update the metadata properties with an additional field and use this for filtering.
Therefor, it would be a great addition to this registry if in case an avro-schema is created or updated, the fingerprintId could be calcuclated and stored in an easily accessible way.
see also this thread
@Apicurio/developers WDYT?
A question that jumps to my mind on this: when are you calculating the fingerprintId? We actually do fingerprinting of all artifacts added to the registry already. We create two hashes, a content-hash and a canonical content-hash. Either one can be used to efficiently look up the schema content. However it is not currently possible to customize the algorithm - right now we're just using a SHA256 (or maybe it's SHA512, I would need to check) hash of the content and another hash of the canonicalized content.
If the hash we automatically generate is not useful (because you want to use the Avro one), perhaps the solution is to allow customizing the hashing algorithm we use...
Yes, that would be an option. I saw that the registry supports a kind of plugin mechanism, it could be sufficient to provide a hashing plugin that uses the avro-default-fingerprint function. This should be used whenever a schema of type avro is created/updated. It must be possible to do a direct request to the schema using this hash, so we would need a top level id in storage/persistence and query api.
There is also a custom metadata properties feature on artifacts and artifact versions, which could be used in a search query. It may not be very efficient to use the search endpoint, but it can be a workaround until we figure out something better.
I don't have a strong opinion, but, probably, the best solution can be to allow customizing the hashing algorithm. I also agree with Jakub that adding that custom hash as a property might be a temporary workaround.