serde_arrow icon indicating copy to clipboard operation
serde_arrow copied to clipboard

Export OuterSequenceDeserializer.

Open ryzhyk opened this issue 1 year ago • 4 comments

I would like to be able to use OuterSequenceDeserializer directly rather than via fn from_record_batch(). The reason is that we have our own version of the serde::Deserialize trait that provides some additional flexibility.

@chmp , I don't think the PR can be merged as is, since it exports the Mut type that is meant to be internal, but I wanted to see if you'd be open to this change in principle before implementing a cleaner solution.

ryzhyk avatar May 08 '24 17:05 ryzhyk

Hey again :).

Given the fact that I completely rewrote the internals 3 times by now in rather rapid succession, I am hesitant to make them part of the public API. Could explain a bit more what your use case is? And if the code is public, point me to it?

chmp avatar May 08 '24 18:05 chmp

Hey again :).

Given the fact that I completely rewrote the internals 3 times by now in rather rapid succession, I am hesitant to make them part of the public API. Could explain a bit more what your use case is? And if the code is public, point me to it?

So let's say we have our own timestamp type that represents time in microseconds internally. We would like to be able to deserialize this type from a whole zoo of different fomats (parquet, avro, debezium json, etc.), which all use different representations of time. In fact, even just focusing on Parquet, time can be encoded as secs, millis, usecs, or nanosecs, as you know :) To support this, we defined a version of the serde::Deserialize trait that takes additional metadata that tells each type how to deserialize itself. This custom trait still expects an instance of serde::Deserializer to extract the actual values from.

So as you can see, we cannot use from_record_batch directly, since we don't have a serde::Deserialize implementation it could use, but we could use OuterSequenceDeserializer directly instead if it was exported.

I understand you reluctance to expose the internals of the crate though. We can just keep maintaining a fork until we find a solution that works well on both sides.

ryzhyk avatar May 11 '24 20:05 ryzhyk

Ah. Yeah. I ran into a similar issues, these pesky time units :).

Would any Deserializer work for you? Or does it have to be the OuterSequenceDeserializer specifically?

I think exporting a Deserializer would be cool, as it mirrors other crates and offers a bit more flexibility. I would like to hide the implementation details though. Most likely by using an opaque wrapper around the OuterSequenceDeserializer that only exposes the Deserializer impl. Something like:

struct Deserializer(...);

impl serde::Deserializer for Deserializer { ... }

pub fn build_deserializer(fields: Vec<FieldRef>) -> Result<Deserializer> { ... }

chmp avatar May 12 '24 18:05 chmp

This sounds perfect. We just really need a handle to anything that implements the Deserializer trait. I will update my PR along these lines.

ryzhyk avatar May 12 '24 18:05 ryzhyk

Just a heads up. As I really like the idea, I started work on this feature here: https://github.com/chmp/serde_arrow/pull/172

chmp avatar May 26 '24 19:05 chmp

Nice! Sorry, it was taking me too long to get around to finishing this.

ryzhyk avatar May 27 '24 00:05 ryzhyk

Could you maybe skim the current PR, whether it would work for your use case? (I don't see issues, but maybe I misunderstood your requirements).

chmp avatar May 27 '24 16:05 chmp