serde icon indicating copy to clipboard operation
serde copied to clipboard

Consider forwarding visit_bytes to visit_seq

Open RReverser opened this issue 6 years ago • 3 comments

If a self-describing target encodes byte blobs separately, then it might want to call visit_bytes / visit_byte_buf from deserialize_any.

However, currently this fails with an error when used against any Vec by default, because Serde expects visit_bytes to be called only in response to an explicit deserialize_bytes (from a wrapper like serde-bytes).

I don't see any reason not to allow this from a semantic point of view, and forward visit_bytes to visit_seq by default, which would automatically allow both visit_bytes and visit_byte_buf from deserialize_any.

Would you be open to such change?

RReverser avatar Mar 19 '19 15:03 RReverser

I would want to know more about the use case before doing this. Would you be able to share a concrete use case that is blocked by this issue or some specific Deserializers that would benefit from having this?


I don't see any reason not to allow this from a semantic point of view,

Some reasons other than semantics:

  • Deserializing a Vec would normally result in a call to deserialize_seq rather than deserialize_any. The reported issue would usually only be observable if a Deserializer forwards deserialize_seq to deserialize_any but I would like to move in the direction of discouraging that. It worsens compile times and bloat in the binary when a data format's deserialize_any ends up instantiating lots more of the methods of the visitor generic type V than necessary.

  • Formats in which deserializing bytes as a Vec comes up frequently can already choose to support this in their deserialize_seq implementation so that the caller no longer needs to depend on serde_bytes.

  • There are endless of these semantically justifiable implicit conversions that we could make happen automatically -- string to seq of char, string to seq of u8, seq of char to string, string to integer, boolean to integer, integer to string, empty seq to unit, unit to empty seq, unit to empty map, unit to empty string, etc.. We need to draw a line somewhere and there will always be choices near the boundary that people think are obviously justifiable based on the conversions already implemented.

  • Implicit conversions performed by Visitor are not free; they tend to add code size in programs that do not ever benefit from the conversion. This one would involve instantiating visit_seq with a SeqAccess type parameter that it wouldn't otherwise be instantiated with, different from the data format's SeqAccess.

  • Less precise error messages. We removed a similar implicit conversion from string to seq in part because wrongly deserializing a Vec when the input contained a string would end up saying something like "invalid type: integer `65`, expected struct T" instead of "invalid type: string \"A\", expected a vector".

dtolnay avatar Mar 19 '19 20:03 dtolnay

The reported issue would usually only be observable if a Deserializer forwards deserialize_seq to deserialize_any

Yeah, this is exactly the case I have in mind (forwarding any types that can be inferred to deserialize_any).

but I would like to move in the direction of discouraging that

Huh, I thought this is explicitly encouraged by forward_to_deserialize_any! macro and the docs which currently say:

Self-describing formats can save a lot of code by using the forward_to_deserialize_any! macro to ignore hints and forward some or all of the methods of the Deserializer trait to the deserialize_any method.

Particularly, it helps that with deserialize_any it's nice that you don't need a separate helper method just to infer proper de::Unexpected::<type> from your value - instead this is "taken care of" by the fact that incompatible visitor method gets called on the target type and results in a self-describing error automatically.

There are endless of these semantically justifiable implicit conversions that we could make happen automatically -- string to seq of char, string to seq of u8, seq of char to string, string to integer, boolean to integer, integer to string, empty seq to unit, unit to empty seq, unit to empty map, unit to empty string, etc.. We need to draw a line somewhere and there will always be choices near the boundary that people think are obviously justifiable based on the conversions already implemented.

Yeah, I agree with this... In fact, I was thinking of raising a separate docs issue to clarify relations between deserialize_* and visit_* methods because currently I often have to look up implementation of particular Deserialize on standard types to understand what's supported and what's not.

For now the way I tried to think of this line is "if it's a helper method that just allows a more precise serializer/deserializer, then it should be backwards compatible with regular one too", and in this regard deserialize_byte_buf / deserialize_bytes seems to be (and is documented as) an optimised version of deserialize_seq, so it seems to make sense for it to fall back to visit_seq when an optimised target doesn't exist.

That said, as mentioned above, this line seems a bit unclear even now and I have to look up which number conversions are allowed and which deserialize_* I need to implement manually vs forward to e.g. deserialize_u64. A guideline in docs would be highly appreciated!

Implicit conversions performed by Visitor are not free; they tend to add code size in programs that do not ever benefit from the conversion.

Agreed. Although in this case it's just a default, so the situation seems different for the following reasons:

  1. If an override for visit_bytes is provided, it will be used instead, and this conversion won't be in the final code.
  2. If an input format doesn't have self-described bytes and so doesn't call visit_bytes, it won't be in the final code either.
  3. Finally, if an override is not provided, but format does have a "byte blob" type, then it's most likely that concrete deserializer will have to implement such conversion itself to be able to respond to deserialize_seq, and for this case it's better to have a single centralised implementation of such conversion that would appear just once in the code, than have each Deserializer implement their own.

RReverser avatar Mar 19 '19 22:03 RReverser

rmp-serde had this issue, and disallowing visit_bytes for sequences has fixed compatibility with [u8; N] and other standard types.

I've also implemented an ugly hack to fake specialization and use bytes encoding for collections with 1-byte items.

I'm not sure if it would be a good idea for serde to try to fake specialization itself (based on sizeof of elements, or maybe literal u8 type name match in derive macros). I guess it needs Rust to stabilize proper specialization.

kornelski avatar Apr 30 '24 13:04 kornelski