avro-util icon indicating copy to clipboard operation
avro-util copied to clipboard

Using other string types besides UTF8.class

Open oleg-smith opened this issue 2 years ago • 19 comments

oleg-smith avatar Oct 17 '22 21:10 oleg-smith

@FelixGV could you please review

oleg-smith avatar Oct 20 '22 04:10 oleg-smith

i see this adding "defaultStringClass", but where is it used? is it intendd to be used by subclasses? a followup PR?

radai-rosenblatt avatar Oct 20 '22 05:10 radai-rosenblatt

it's intended to be used in client's code like this

FastSpecificDatumReader<T> reader = new FastSpecificDatumReader<>(
            writerSchema,
            readerSchema,
            new FastSerdeCache(String.class)
        );

oleg-smith avatar Oct 20 '22 05:10 oleg-smith

@oleg-smith Can this be done by adding string annotations to the reader schema instead of updating the serde since fast-avro is trying to align with the standard avro implementation?

gaojieliu avatar Oct 20 '22 05:10 gaojieliu

@gaojieliu can you bring an example of how to instruct the generator to use String. class this way?

oleg-smith avatar Oct 20 '22 05:10 oleg-smith

@oleg-smith If you want to use Java String type for the regular string type, you could put additional annotation like this way:

 "type": {
                "type": "string",
                "avro.java.string": "String"
            }

If you want to use Java String type as the deserialized Map key:

"type": {
        "type": "map",
        "values": "long",
        "java-key-class": "java.lang.String", // required by specific deserializer
        "avro.java.string": "String" // required by generic deserializer
      },

Of coz, you need to use Avro-1.7+ to take advantage of these annotations.

gaojieliu avatar Oct 20 '22 16:10 gaojieliu

@gaojieliu I meant how to do it from Java code? I don't have access to schema file

oleg-smith avatar Nov 09 '22 16:11 oleg-smith

and how to do it for all String fields at once?

oleg-smith avatar Nov 09 '22 17:11 oleg-smith

@oleg-smith The fast-avro class is generated from the schema string/file, so how come you don't have access to the schema file? Inside the application, before generating fast classes, I think you could traverse the schema file to add annotation to all the string types, so that we could keep the fast-avro logic align with the vanilla Avro. Maybe we could build some utility for this in fast-avro to add annotations to all the string types of one schema.

@radai-rosenblatt What is your thought on this?

gaojieliu avatar Nov 09 '22 17:11 gaojieliu

how come - it comes from the external jar as already generated Avro schema

oleg-smith avatar Nov 09 '22 17:11 oleg-smith

also, is there a way to make deserializer generation synchronous? it's a bit of undeterministic now

oleg-smith avatar Nov 09 '22 17:11 oleg-smith

@oleg-smith https://github.com/linkedin/avro-util/blob/master/avro-fastserde/src/main/java/com/linkedin/avro/fastserde/FastSerdeCache.java#L355 Here is the method to generate a fast specific class synchronously.

Regarding supporting specific CharSequence impl override in fast-avro, I think a utility to add right string annotation could be the right way to go since the string annotated schema can be used by vanilla Avro as well in case fast-avro has a bug, you still could fall back to vanilla Avro with the same outcome.

gaojieliu avatar Nov 09 '22 17:11 gaojieliu

Can this method be used in FastSerdeCache?

oleg-smith avatar Nov 09 '22 18:11 oleg-smith

yeah, it is a public method in FastSerdeCache.

gaojieliu avatar Nov 09 '22 18:11 gaojieliu

a config would still be needed - because some folks would like to ignore hints on schemas even if you plan on supporting those hints

radai-rosenblatt avatar Nov 09 '22 18:11 radai-rosenblatt

@gaojieliu I mean is there a way to build it and put it into the cache synchronously?

looks like currently it's only async

oleg-smith avatar Nov 09 '22 18:11 oleg-smith

to clarify what i posted above: even if fast-avro respected "logical types" (which those string hints are), some users (me) would like to be able to override them to get consistent behaviour across all schemas in my codebase at runtime.

so i think making this a config option to fast-avro at runtime (as originally suggested) is a better approach, and definitely needs to happen before fast-avro starts respecting logical types :-)

also - this means that an implementation is required, which is not part of this PR?

radai-rosenblatt avatar Nov 09 '22 19:11 radai-rosenblatt

I see. Such method doesn't exist today, and feel free to add them in FastSerdeCache and it should be fairly straightforward.

gaojieliu avatar Nov 09 '22 19:11 gaojieliu

Regarding supporting specific CharSequence impl override in fast-avro, I think a utility to add right string annotation could be the right way to go since the string annotated schema can be used by vanilla Avro as well in case fast-avro has a bug, you still could fall back to vanilla Avro with the same outcome.

By this, do you mean that we would provide a "schema processing" utility, where the input is a schema containing string fields, and the output would be the same schema where all string fields were overridden to be of a type defined by the user?

FelixGV avatar Nov 09 '22 20:11 FelixGV