avro
avro copied to clipboard
AVRO-3646: serde for enum with mixed variants
What is the purpose of the change
- Adds unit tests for AVRO-3646
Verifying this change
This change just adds unit tests
Documentation
- Does this pull request introduce a new feature? no
@LucasJavaudin Could you please review the new unit tests ? Thanks!
Converted to Draft because as the JIRA ticket explains the produced Value
s cannot be represented using an Avro Schema
.
The serialization should produce a Value::Union
that wraps each of the currently produced Value
s.
@LucasJavaudin Could you please review the new unit tests ? Thanks!
Yes, this is good!
Can I push to this Draft a test that do not pass but that I think should pass?
Yes, this is good!
Is it ?
IMO the values should always be wrapped in Value::Union
, so that it is possible to validate it with a Schema. The union items should be:
- Value::String for unit_variant
- Value::Array for newtype_variant and tuple_variant
- Value::Record for struct_variant
Can I push to this Draft a test that do not pass but that I think should pass?
Only Avro project committers can push to this repository. You will have to open a PR against the branch for this PR.
Is it ? IMO the values should always be wrapped in
Value::Union
, so that it is possible to validate it with a Schema. The union items should be:
Sorry, I meant that the test are useful and they currently pass but I agree that they should not pass.
* Value::String for unit_variant * Value::Array for newtype_variant and tuple_variant * Value::Record for struct_variant
For unit variant, shouldn't it be Value::Null
?
There might be more than one unit variants in the enum. I think it should be Value::String
with the name of the variant, to be able to deserialize.
Consider the following enum:
enum MyEnum {
Val1,
Val2,
Val3(u64),
}
What should be its schema?
I can think of two options.
A record with two fields
- Field "type" which is an enum with symbols "Val1", "Val2", "Val3".
- Field "value" which is an union.
The union cannot be ["null", "null", "long"]
(or ["string", "string", "long"]
) because we cannot have the same type twice. Correct?
Can it be just ["null", "long"]
or [enum(["Val1"]), enum(["Val2"]), "long"]
? Neither seems intuitive to me.
An union
With the enum [enum(["Val1"]), enum(["Val2"]), "long"]
, we can know both the variant and the data (if any) but this requires to match the Val3(u64)
variant to "long", without using the variant name. (Can we match variants by their order?)
@LucasJavaudin I think the PR is ready for final review! I'd be thankful if you can provide more use cases / unit tests!
The avro_derive tests fail ... I'll investigate soon!
@martin-g Thank you for your work! This solves the issue I raised in AVRO-3646.
I pushed two new integration tests in #1949 that try to deserialize and serialize more complex enums. The tests do not pass for multiple reasons. I think that would improve greatly the library if we could (de)serialize arbitrary enums (and even derive a schema for them) but maybe this needs to much work?
Unfortunately, I don't have enough time myself to do more than adding tests and reviewing your changes.
The avro_derive tests fail ...
The problem here is that another user (@jklamer, the contributor of the avro_derive crate) expects that
#[derive(Debug, Deserialize, Serialize, AvroSchema, Clone, PartialEq, Eq)]
enum MyEnum {
Foo,
Bar,
Baz,
}
should map to this Avro schema:
{
"name":"myenum",
"type":{
"type":"enum",
"name":"MyEnum",
"symbols":["Foo", "Bar", "Baz"]
},
"default":"Foo"
}
Which is really nice! But unfortunately does not work anymore with the new way of supporting complex Rust enums (ones with non-null tuple/newtype/struct variants).
The only way to still support the above I see is to introduce a second Serializer
impl, e.g. UnitEnumSerializer and let the user decide which one to use. But the problem is that a Writer
could use either the complex Serializer or the UnitEnumSerializer. There won't be a way to use one serializer for a given RecordField and another serializer for another field.
Also CC @spiegela
Yeah, this conversation is split over a couple different PR's, threads, and e-mails. But I'll respond here. With avro_derive
I was very intentional about only tackling cases that I could make work automatically with serde and the normal serializer for the best new user experience. Modeling a Rust enum as some Avro entity opened too many cans of worms to handle automatically, having something opt in would be the only way to go but would also make it harder to mix Rust enums and Avro enums clearly (as mentioned above), to say nothing of the ambiguity of the deserialization process.
My bias would be to create a Macro that creates an intermediate form that is unambiguously convertible between Rust Enum and an Avro compatible entity and do the conversion before serialization and after deserialization (don't try to make serde process work outside the avro spec) (and could point to macro in the AvroSchema derive error), but I personally haven't been able to think of a solution that works for every case. Rust's type system is just most expansive that Avro's, so the ambiguity is inevitable at some level.
That being said, I LOVE Rust enums, with a passion, and when I realized I couldn't automatically serde them... I started something that could. <-- me shaving the yak.
TLDR: I think it's impossible to do automatically without creating ambiguity that I would like to shield most users from. I think doing it as an opt in with clearly defined restrictions (and good error messages) would be a good to go for advanced users.
CC @spiegela CC @LucasJavaudin
I wonder whether we can use https://serde.rs/enum-representations.html#untagged to give a hint to apache_avro (de)serialization process to keep the old behavior. I am looking into serde's source code to see how impls like apache_avro can know that this attribute is being used.
Actually it doesn't need to be #serde(untagged)
. It could be even #avro(something)
.
I miss Java annotations + reflection APIs :-)
@martin-g hahaha agreed. And now with Java Sealed classes, a pretty clear Rust Enum 1-1 could be made too!
Another user asked for the same functionality at https://github.com/serde-rs/serde/issues/2309
To make sure I understand the status of change, is the remaining question here how the AvroSchema derive should ”choose” which enum format to use when constructing the schema?
If so, I was thinking that the AvroSchema derive could infer the correct behavior based on the variant characteristics:
- All unit variants, create an enum schema as we do today.
- 1+ non-unit variants with either named fields, or single-field unnamed fields, use the Record-style (as in this PR) as the schema.
That leaves the case of non-unit, multi-field, unnamed fields 😅. For those, we should return a compile error, I think, and let the dev implement AvroSchemaComponent
themselves — otherwise, could add Record fields named 0
, 1
, …
The actual problem in ser.rs and de.rs.
This PR changes the way apache_avro serializes objects/instances into apache_avro::types::Value
.
For now I see no way to make the serde logic more dynamic, i.e. depending on the enum variants' types or on container/field's attribute to decide what kind of Value
to construct.
The new serde logic is powerful enough to handle complex kinds of Rust enums (by using Schema::Record([Schema::Enum, Schema::Union])). But it does not support the old representation (Schema::Enum) because all variants are mapped to Value::Null and there could be at most one null
in an Avro Union.
The logic what schema to derive looks like a simpler task (although I haven't tried it yet, so I might be wrong).
Seems like AVRO-3695 should be addressed in this issue. (see https://github.com/apache/avro/pull/2035)
However I just tested it against Master + AVRO-3631 and it does not address the problem.
This is my last blocking issue for using Rust for our industry protocol, so hoping it can be looked at. @martin-g ?
I've created what is probably a 'Hack' fix that addresses the issue for me. But honestly I'm not sure what i'm doing here, so it likley breaks other stuff. (So I havn't made a PR, just added here for your info)
ser.rs. line # 260 or so
fn serialize_newtype_variant<T: ?Sized>(
self,
_enum_name: &'static str,
index: u32,
_variant: &'static str,
value: &T,
) -> Result<Self::Ok, Self::Error>
where
T: Serialize,
{
let serialized = value.serialize(self)?;
let _idx = match serialized {
// assumes that the Null variant is always the first one, as in Avro's Union
Value::Null => 0,
_ => index,
};
// Ok(Value::Record(vec![
// ("type".to_owned(), Value::Enum(index, variant.to_owned())),
// ("value".to_owned(), Value::Union(idx, Box::new(serialized))),
// ]))
Ok(Value::Union(index, Box::new(serialized)))
}