avro icon indicating copy to clipboard operation
avro copied to clipboard

AVRO-3646: serde for enum with mixed variants

Open martin-g opened this issue 2 years ago • 21 comments

What is the purpose of the change

  • Adds unit tests for AVRO-3646

Verifying this change

This change just adds unit tests

Documentation

  • Does this pull request introduce a new feature? no

martin-g avatar Oct 20 '22 09:10 martin-g

@LucasJavaudin Could you please review the new unit tests ? Thanks!

martin-g avatar Oct 20 '22 09:10 martin-g

Converted to Draft because as the JIRA ticket explains the produced Values cannot be represented using an Avro Schema. The serialization should produce a Value::Union that wraps each of the currently produced Values.

martin-g avatar Oct 21 '22 06:10 martin-g

@LucasJavaudin Could you please review the new unit tests ? Thanks!

Yes, this is good!

LucasJavaudin avatar Oct 21 '22 08:10 LucasJavaudin

Can I push to this Draft a test that do not pass but that I think should pass?

LucasJavaudin avatar Oct 21 '22 08:10 LucasJavaudin

Yes, this is good!

Is it ? IMO the values should always be wrapped in Value::Union, so that it is possible to validate it with a Schema. The union items should be:

  • Value::String for unit_variant
  • Value::Array for newtype_variant and tuple_variant
  • Value::Record for struct_variant

martin-g avatar Oct 21 '22 08:10 martin-g

Can I push to this Draft a test that do not pass but that I think should pass?

Only Avro project committers can push to this repository. You will have to open a PR against the branch for this PR.

martin-g avatar Oct 21 '22 08:10 martin-g

Is it ? IMO the values should always be wrapped in Value::Union, so that it is possible to validate it with a Schema. The union items should be:

Sorry, I meant that the test are useful and they currently pass but I agree that they should not pass.

* Value::String for unit_variant

* Value::Array for newtype_variant and tuple_variant

* Value::Record for struct_variant

For unit variant, shouldn't it be Value::Null?

LucasJavaudin avatar Oct 21 '22 08:10 LucasJavaudin

There might be more than one unit variants in the enum. I think it should be Value::String with the name of the variant, to be able to deserialize.

martin-g avatar Oct 21 '22 08:10 martin-g

Consider the following enum:

enum MyEnum {
    Val1,
    Val2,
    Val3(u64),
}

What should be its schema?

I can think of two options.

A record with two fields

  • Field "type" which is an enum with symbols "Val1", "Val2", "Val3".
  • Field "value" which is an union.

The union cannot be ["null", "null", "long"] (or ["string", "string", "long"]) because we cannot have the same type twice. Correct?

Can it be just ["null", "long"] or [enum(["Val1"]), enum(["Val2"]), "long"]? Neither seems intuitive to me.

An union

With the enum [enum(["Val1"]), enum(["Val2"]), "long"], we can know both the variant and the data (if any) but this requires to match the Val3(u64) variant to "long", without using the variant name. (Can we match variants by their order?)

LucasJavaudin avatar Oct 21 '22 09:10 LucasJavaudin

@LucasJavaudin I think the PR is ready for final review! I'd be thankful if you can provide more use cases / unit tests!

martin-g avatar Nov 08 '22 21:11 martin-g

The avro_derive tests fail ... I'll investigate soon!

martin-g avatar Nov 08 '22 21:11 martin-g

@martin-g Thank you for your work! This solves the issue I raised in AVRO-3646.

I pushed two new integration tests in #1949 that try to deserialize and serialize more complex enums. The tests do not pass for multiple reasons. I think that would improve greatly the library if we could (de)serialize arbitrary enums (and even derive a schema for them) but maybe this needs to much work?

Unfortunately, I don't have enough time myself to do more than adding tests and reviewing your changes.

LucasJavaudin avatar Nov 10 '22 09:11 LucasJavaudin

The avro_derive tests fail ...

The problem here is that another user (@jklamer, the contributor of the avro_derive crate) expects that

#[derive(Debug, Deserialize, Serialize, AvroSchema, Clone, PartialEq, Eq)]
enum MyEnum {
    Foo,
    Bar,
    Baz,
}

should map to this Avro schema:

{
  "name":"myenum",
  "type":{
    "type":"enum",
     "name":"MyEnum",
     "symbols":["Foo", "Bar", "Baz"]
   },
   "default":"Foo"
}

Which is really nice! But unfortunately does not work anymore with the new way of supporting complex Rust enums (ones with non-null tuple/newtype/struct variants).

The only way to still support the above I see is to introduce a second Serializer impl, e.g. UnitEnumSerializer and let the user decide which one to use. But the problem is that a Writer could use either the complex Serializer or the UnitEnumSerializer. There won't be a way to use one serializer for a given RecordField and another serializer for another field.

Also CC @spiegela

martin-g avatar Nov 11 '22 12:11 martin-g

Yeah, this conversation is split over a couple different PR's, threads, and e-mails. But I'll respond here. With avro_derive I was very intentional about only tackling cases that I could make work automatically with serde and the normal serializer for the best new user experience. Modeling a Rust enum as some Avro entity opened too many cans of worms to handle automatically, having something opt in would be the only way to go but would also make it harder to mix Rust enums and Avro enums clearly (as mentioned above), to say nothing of the ambiguity of the deserialization process.

My bias would be to create a Macro that creates an intermediate form that is unambiguously convertible between Rust Enum and an Avro compatible entity and do the conversion before serialization and after deserialization (don't try to make serde process work outside the avro spec) (and could point to macro in the AvroSchema derive error), but I personally haven't been able to think of a solution that works for every case. Rust's type system is just most expansive that Avro's, so the ambiguity is inevitable at some level.

That being said, I LOVE Rust enums, with a passion, and when I realized I couldn't automatically serde them... I started something that could. <-- me shaving the yak.

TLDR: I think it's impossible to do automatically without creating ambiguity that I would like to shield most users from. I think doing it as an opt in with clearly defined restrictions (and good error messages) would be a good to go for advanced users.

CC @spiegela CC @LucasJavaudin

jklamer avatar Nov 11 '22 18:11 jklamer

I wonder whether we can use https://serde.rs/enum-representations.html#untagged to give a hint to apache_avro (de)serialization process to keep the old behavior. I am looking into serde's source code to see how impls like apache_avro can know that this attribute is being used.

martin-g avatar Nov 11 '22 19:11 martin-g

Actually it doesn't need to be #serde(untagged). It could be even #avro(something). I miss Java annotations + reflection APIs :-)

martin-g avatar Nov 11 '22 20:11 martin-g

@martin-g hahaha agreed. And now with Java Sealed classes, a pretty clear Rust Enum 1-1 could be made too!

jklamer avatar Nov 11 '22 20:11 jklamer

Another user asked for the same functionality at https://github.com/serde-rs/serde/issues/2309

martin-g avatar Nov 14 '22 07:11 martin-g

To make sure I understand the status of change, is the remaining question here how the AvroSchema derive should ”choose” which enum format to use when constructing the schema?

If so, I was thinking that the AvroSchema derive could infer the correct behavior based on the variant characteristics:

  • All unit variants, create an enum schema as we do today.
  • 1+ non-unit variants with either named fields, or single-field unnamed fields, use the Record-style (as in this PR) as the schema.

That leaves the case of non-unit, multi-field, unnamed fields 😅. For those, we should return a compile error, I think, and let the dev implement AvroSchemaComponent themselves — otherwise, could add Record fields named 0, 1, …

spiegela avatar Nov 14 '22 13:11 spiegela

The actual problem in ser.rs and de.rs. This PR changes the way apache_avro serializes objects/instances into apache_avro::types::Value. For now I see no way to make the serde logic more dynamic, i.e. depending on the enum variants' types or on container/field's attribute to decide what kind of Value to construct.

The new serde logic is powerful enough to handle complex kinds of Rust enums (by using Schema::Record([Schema::Enum, Schema::Union])). But it does not support the old representation (Schema::Enum) because all variants are mapped to Value::Null and there could be at most one null in an Avro Union.

The logic what schema to derive looks like a simpler task (although I haven't tried it yet, so I might be wrong).

martin-g avatar Nov 14 '22 15:11 martin-g

Seems like AVRO-3695 should be addressed in this issue. (see https://github.com/apache/avro/pull/2035)

However I just tested it against Master + AVRO-3631 and it does not address the problem.

This is my last blocking issue for using Rust for our industry protocol, so hoping it can be looked at. @martin-g ?

I've created what is probably a 'Hack' fix that addresses the issue for me. But honestly I'm not sure what i'm doing here, so it likley breaks other stuff. (So I havn't made a PR, just added here for your info)

ser.rs. line # 260 or so

fn serialize_newtype_variant<T: ?Sized>(
        self,
        _enum_name: &'static str,
        index: u32,
        _variant: &'static str,
        value: &T,
    ) -> Result<Self::Ok, Self::Error>
    where
        T: Serialize,
    {
        let serialized = value.serialize(self)?;
        let _idx = match serialized {
            // assumes that the Null variant is always the first one, as in Avro's Union
            Value::Null => 0,
            _ => index,
        };
        // Ok(Value::Record(vec![
        //     ("type".to_owned(), Value::Enum(index, variant.to_owned())),
        //     ("value".to_owned(), Value::Union(idx, Box::new(serialized))),
        // ]))

        Ok(Value::Union(index, Box::new(serialized)))
    }

markfarnan avatar Mar 13 '23 19:03 markfarnan