avro icon indicating copy to clipboard operation
avro copied to clipboard

AVRO-3631: [Rust] Add support for (de)serializing Rust byte arrays to Avro values

Open martin-g opened this issue 3 years ago • 8 comments

What is the purpose of the change

To make it possible to serialize Rust structs to Avro values (types::Value).

Verifying this change

This change added tests and can be verified as follows:

  • new unit and integration tests have been added

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? Rustdoc

martin-g avatar Oct 06 '22 08:10 martin-g

@martin-g Couple small conflicts when merging this with AVRO-3683.

In the absence of Serde-Bytes #28, Could this be made to use the new serde-bytes-array crate and included in the release ?

This issue with fixed is blocking use of AVRO on my side.

markfarnan avatar Mar 11 '23 13:03 markfarnan

This issue with fixed is blocking use of AVRO on my side.

If you only need deserialization ATM (because serialization it's not implemented yet in that rework), you may want to give a test to the redesign considered at https://issues.apache.org/jira/browse/AVRO-3714?focusedCommentId=17696183&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17696183, where that is supported (in addition to x10 perf improvements)

Ten0 avatar Mar 11 '23 18:03 Ten0

I need both serialization and deserialization, using avro-datum (no embeded schemas) for complex multifile schemas (see the recently merged AVRO-3683). Also fixed length arrays, and multi-value nullable unions. If the redesign can do all that, happy to look !

markfarnan avatar Mar 13 '23 13:03 markfarnan

I need both serialization and deserialization, using avro-datum (no embeded schemas) for complex multifile schemas (see the recently merged AVRO-3683). Also fixed length arrays, and multi-value nullable unions. If the redesign can do all that, happy to look !

Everything you mentioned should be supported, however only on the deserialization side for (serialization is still a WIP).

  • Multifile schemas because schema parsing uses that of apache-avro.
  • Fixed length arrays & multi-value nullable unions as described at https://docs.rs/serde_avro_fast/0.1.0/serde_avro_fast/schema/enum.SchemaNode.html#variant.Union
  • Efficient deserialization of byte arrays via serde_bytes (in particular you can deserialize directly to &[u8] with zero alloc.)

Since you have such a complex use-case, it would likely be of great help for AVRO-3714 if you could give it a try.

Ten0 avatar Mar 13 '23 14:03 Ten0

@martin-g I tested this along with release version of serde-bytes and serde-byte-array.
All works fine for my use cases for Fixed arrays. +1 for merge from me.

I noticed I had to use serde-bytes for vec but serde-byte-array for [u8; 16] Assume this is correct?

markfarnan avatar Mar 13 '23 19:03 markfarnan

noticed I had to use serde-bytes for vec<8> but serde-byte-array for [u8; 16] Assume this is correct?

It is a mess but yes :-/

martin-g avatar Mar 13 '23 19:03 martin-g

Its fine. Works.

markfarnan avatar Mar 13 '23 19:03 markfarnan

If the redesign can do all that, happy to look !

@markfarnan

Happy to say that I've just released it as serde_avro_fast 0.2.0 with serialization support (which was the only thing missing so that it would be in a state where it "can do all that"). (Also fixes basically all currently open issues of apache_avro and has x10 perf)

Please let me know if there's anything I can do to help with testing or if you encounter any issue! Looking forward to hearing from you :)

Ten0 avatar Apr 17 '23 18:04 Ten0

I had a quick look at this, is there more to it than simply using serde_bytes on struct fields ?

Something like this patch
diff --git a/lang/rust/Cargo.lock b/lang/rust/Cargo.lock
index 8fce3d635..572e7bdc0 100644
--- a/lang/rust/Cargo.lock
+++ b/lang/rust/Cargo.lock
@@ -92,6 +92,7 @@ dependencies = [
  "regex-lite",
  "rstest",
  "serde",
+ "serde_bytes",
  "serde_json",
  "serial_test",
  "sha2",
@@ -1175,6 +1176,15 @@ dependencies = [
  "serde_derive",
 ]
 
+[[package]]
+name = "serde_bytes"
+version = "0.11.15"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "387cc504cb06bb40a96c8e04e951fe01854cf6bc921053c954e4a606d9675c6a"
+dependencies = [
+ "serde",
+]
+
 [[package]]
 name = "serde_derive"
 version = "1.0.204"
diff --git a/lang/rust/Cargo.toml b/lang/rust/Cargo.toml
index a6fd2ad48..e6b8b5767 100644
--- a/lang/rust/Cargo.toml
+++ b/lang/rust/Cargo.toml
@@ -43,6 +43,7 @@ documentation = "https://docs.rs/apache-avro"
 [workspace.dependencies]
 log = { default-features = false, version = "0.4.22" }
 serde = { default-features = false, version = "1.0.204", features = ["derive"] }
+serde_bytes = { default-features = false, version = "0.11.15", features = ["std"] }
 serde_json = { default-features = false, version = "1.0.120", features = ["std"] }
 
 [profile.release.package.hello-wasm]
diff --git a/lang/rust/avro/Cargo.toml b/lang/rust/avro/Cargo.toml
index 27728bcf6..df5874d4e 100644
--- a/lang/rust/avro/Cargo.toml
+++ b/lang/rust/avro/Cargo.toml
@@ -64,6 +64,7 @@ log = { workspace = true }
 num-bigint = { default-features = false, version = "0.4.6", features = ["std", "serde"] }
 regex-lite = { default-features = false, version = "0.1.6", features = ["std", "string"] }
 serde = { workspace = true }
+serde_bytes = { workspace = true }
 serde_json = { workspace = true }
 snap = { default-features = false, version = "1.1.0", optional = true }
 strum = { default-features = false, version = "0.26.3" }
diff --git a/lang/rust/avro/src/lib.rs b/lang/rust/avro/src/lib.rs
index a68240600..3c31c1584 100644
--- a/lang/rust/avro/src/lib.rs
+++ b/lang/rust/avro/src/lib.rs
@@ -872,6 +872,7 @@ pub use reader::{
 };
 pub use schema::{AvroSchema, Schema};
 pub use ser::to_value;
+pub use serde_bytes;
 pub use util::{max_allocation_bytes, set_serde_human_readable};
 pub use uuid::Uuid;
 pub use writer::{
diff --git a/lang/rust/avro/tests/avro-3631.rs b/lang/rust/avro/tests/avro-3631.rs
new file mode 100644
index 000000000..caa361eaf
--- /dev/null
+++ b/lang/rust/avro/tests/avro-3631.rs
@@ -0,0 +1,108 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use apache_avro::types::Value;
+use apache_avro::{from_value, serde_bytes, to_value, Schema};
+use serde::{Deserialize, Serialize};
+
+#[test]
+fn avro_3631_test_serialize_fixed_fields() {
+    #[derive(Debug, Serialize, Deserialize)]
+    struct TestStructFixedField<'a> {
+        #[serde(with = "serde_bytes")]
+        bytes_field: &'a [u8],
+
+        #[serde(with = "serde_bytes")]
+        vec_field: Vec<u8>,
+
+        #[serde(with = "serde_bytes")]
+        fixed_field: [u8; 6],
+    }
+
+    let test = TestStructFixedField {
+        bytes_field: &[1, 2, 3],
+        fixed_field: [1; 6],
+        vec_field: vec![2, 3, 4],
+    };
+    let value: Value = to_value(test).unwrap();
+    let schema = Schema::parse_str(
+        r#"
+            {
+                "type": "record",
+                "name": "TestStructFixedField",
+                "fields": [
+                    {
+                        "name": "bytes_field",
+                        "type": "bytes"
+                    },
+                    {
+                        "name": "vec_field",
+                        "type": "bytes"
+                    },
+                    {
+                        "name": "fixed_field",
+                        "type": {
+                            "name": "fixed_field",
+                            "type": "fixed",
+                            "size": 6
+                        }
+                    }
+                ]
+            }
+            "#,
+    )
+    .unwrap();
+    assert!(value.validate(&schema));
+}
+
+#[test]
+fn avro_3631_deserialize_value_to_struct_with_byte_types() {
+    #[derive(Debug, Deserialize, PartialEq)]
+    struct TestStructFixedField {
+        /// borrowed slices are not supported !
+        // #[serde(with = "serde_bytes", borrow)]
+        // slice_field: &'a [u8],
+
+        /// borrowed byte arrays are not supported !
+        // #[serde(with = "serde_byte", borrow)]
+        // borrowed_fixed_field: &'a [u8; 16],
+
+        #[serde(with = "serde_bytes")]
+        vec_field: Vec<u8>,
+
+        #[serde(with = "serde_bytes")]
+        fixed_field: [u8; 6],
+    }
+
+    let expected = TestStructFixedField {
+        // slice_field: &[1, 11, 111],
+        // borrowed_fixed_field: &[2; 16],
+        vec_field: vec![3, 33],
+        fixed_field: [1; 6],
+    };
+
+    let value = Value::Record(vec![
+        // ("slice_field".to_owned(), ??? ) // needs something like Value::BytesSlice(&[u8])
+        // ("borrowed_fixed_field".to_owned(), ??? ) // needs something like Value::FixedSlice(&[u8; 6])
+        ("vec_field".to_owned(), Value::Bytes(vec![3, 33])),
+        (
+            "fixed_field".to_owned(),
+            Value::Fixed(6, Vec::from(expected.fixed_field)),
+        ),
+    ]);
+    assert_eq!(expected, from_value(&value).unwrap());
+}

lerouxrgd avatar Jul 16 '24 15:07 lerouxrgd

@lerouxrgd Please send the patch as a Pull Request, so you get your credits! Otherwise please let me know and I will apply it manually!

martin-g avatar Jul 17 '24 11:07 martin-g

@martin-g You can apply it manually ! I basically took your unit tests and just used serde_bytes on the appropriate fields so you should have the credits =)

lerouxrgd avatar Jul 17 '24 12:07 lerouxrgd

@lerouxrgd 0.17 is very close! I may not have time to apply it before the cut! If you have time to send it as a PR I will find time to review and merge it!

martin-g avatar Jul 17 '24 13:07 martin-g

@martin-g I have prepared GH-3027 for this !

lerouxrgd avatar Jul 18 '24 11:07 lerouxrgd