serde icon indicating copy to clipboard operation
serde copied to clipboard

Untagged enum deserializing expects human-readable representation even with non human-readable data formats

Open bdbai opened this issue 3 years ago • 6 comments

Some data types like std::net::IpAddr has two different representations for human-readable and non-human-readable Deserializers. However, the built-in deserializers that work with enums do not recognize human-readability of the data format, and expect a representation different from what serializers produce. The following example shows a round trip serialization with msgpack that fails during deserialization.

// [dependencies]
// rmp-serde = "1"
// rmpv = "1"
// serde = { version = "1", features = ["derive"] }

use serde::{Deserialize, Serialize};
use std::{io::Cursor, net::IpAddr};
#[derive(Debug, Serialize, Deserialize)]
#[serde(untagged)]
enum HostName {
    Ip(IpAddr),
    DomainName(String),
}
fn main() {
    // Serialize into MessagePack bytes.
    let mp_data = rmp_serde::to_vec(&HostName::Ip([8, 8, 8, 8].into())).unwrap();

    // Inspect the data structure
    let value = rmpv::decode::value::read_value(&mut Cursor::new(mp_data.clone())).unwrap();
    println!("{}", value); // {"V4": [8, 8, 8, 8]}

    // Deserialize back
    rmp_serde::from_slice::<HostName>(&mp_data).unwrap(); // Panic!

    // A MessagePack encoded string "1.1.1.1"
    let mp_data = vec![0xA7u8, 0x31, 0x2E, 0x31, 0x2E, 0x31, 0x2E, 0x31];
    println!("{:?}", rmp_serde::from_slice::<HostName>(&mp_data).unwrap()); // Ip(1.1.1.1)
}

In this example, a ContentRefDeserializer is used behind the scene to pick a suitable variant of HostName. The implementation does not override is_human_readable method on Deserializer trait, thus uses the default value true. Therefore, the non-human-readable representation of IpAddr {"V4": [8, 8, 8, 8]} is rejected. Any suggestions?

PS: #1919 seems related.

bdbai avatar Feb 06 '22 16:02 bdbai

replying to @dtolnay's related comment on https://github.com/serde-rs/serde/pull/1919#pullrequestreview-600321475:

I'll go ahead and close this in the absence of a use case.

@dtolnay: a concrete use case is the zerovec crate, which uses a different serialization format depending on whether the (de)serializer is human readable: https://github.com/unicode-org/icu4x/blob/0e395d9c03f8d98bd3c37a0bba7bced4d5511529/utils/zerovec/src/zerovec/serde.rs#L72-L76

Unfortunately, it's really easy to embed ZeroVec and similar types within, e.g., an untagged enum that then fails to deserialize due to is_human_readable not being propagated through ContentRefDeserializer and related internal deserializers.

Any type that has a Deserialize implementation that depends on the deserializer's is_human_readable field is going to run into this issue. None of these types can be safely embedded within untagged enums, and probablly can't use #[serde(flatten)] or other features that also depend in intermediate deserializers. It makes sense that is_human_readable was intended to be best-effort, but unfortunately it's being relied upon within the ecosystem. Worse, the bugs that result from this issue aren't discoverable until runtime, when round-trip deserialization fails.

While it's not ideal, I think the most straightforward solution here is to propagate is_human_readable through all of the intermediate deserializers whenever possible.

cc: @Manishearth

ramosbugs avatar Nov 09 '22 01:11 ramosbugs

Wait, why is this a problem with flatten? Flatten would be straightforward, no?

Manishearth avatar Nov 09 '22 02:11 Manishearth

#[serde(flatten)] uses FlatMapDeserializer, which is one of the internal deserializers that doesn't propagate is_human_readable from the underlying deserializer: https://github.com/serde-rs/serde/blob/fb2fe409c8f7ad6c95e3096e5e9ede865c8cfb49/serde_derive/src/de.rs#L2574

ramosbugs avatar Nov 09 '22 02:11 ramosbugs

Oh, wait, flatten's different from transparent. Never mind.

Manishearth avatar Nov 09 '22 02:11 Manishearth

We've also been hit by this. We're using rmp_serde in combination with enum_map crate. Simple code like below, fails.

use enum_map::*;
use serde::*;

fn main() {
    let val = Foo {
        bar: Bar {
            baz: enum_map! {_ => ()},
        },
    };
    let buf = rmp_serde::encode::to_vec(&val).unwrap();
    // panic here
    let _: Foo = rmp_serde::decode::from_slice(&buf).unwrap();
    
}

#[derive(Serialize, Deserialize)]
struct Foo {
    #[serde(flatten)]
    bar: Bar,
}

#[derive(Serialize, Deserialize)]
struct Bar {
    baz: EnumMap<bool, ()>,
}

We have spent some time trying to figure out why. In enum_map there's code like this:

/// Requires crate feature `"serde"`
impl<K: EnumArray<V> + Serialize, V: Serialize> Serialize for EnumMap<K, V> {
    fn serialize<S: Serializer>(&self, serializer: S) -> Result<S::Ok, S::Error> {
        if serializer.is_human_readable() {
            serializer.collect_map(self)
        } else {
            let mut tup = serializer.serialize_tuple(self.len())?;
            for value in self.values() {
                tup.serialize_element(value)?;
            }
            tup.end()
        }
    }
}

/// Requires crate feature `"serde"`
impl<'de, K, V> Deserialize<'de> for EnumMap<K, V>
where
    K: EnumArray<V> + EnumArray<Option<V>> + Deserialize<'de>,
    V: Deserialize<'de>,
{
    fn deserialize<D: Deserializer<'de>>(deserializer: D) -> Result<Self, D::Error> {
        if deserializer.is_human_readable() {
            deserializer.deserialize_map(HumanReadableVisitor(PhantomData))
        } else {
            deserializer.deserialize_tuple(K::LENGTH, CompactVisitor(PhantomData))
        }
    }
}

After adding some debug, it turns out that it does NOT go into serializer.is_human_readable() branch but when decoding, it does go into deserializer.is_human_readable() branch. So in our case, it writes a sequence and then tries to decode a map and it fails.

Removing #[serde(flatten)] attribute makes the round-trip serialisation work.

#1919 seems very related but it seems to have been closed: I don't understand why. Clearly, this is a bug right? It was explained to be so in https://github.com/serde-rs/serde/pull/1919#issuecomment-787594062 and flatten is specifically mentioned there.

Fuuzetsu avatar Nov 30 '22 05:11 Fuuzetsu

Reason it doesn't work is that if #[serde(flatten)] or #[serde(untagged)] or other such are used then the real deserializer gets replaced by ContentDeserializer or ContentRefDeserializer to allow attempting to deserialize the data multiple times. These other deserializers do not carry over the is_human_readable() flag from the original deserializer. It is a bug in serde, quite clearly, if the is_human_readable() flag is supposed to work.

nakedible avatar Jan 22 '24 22:01 nakedible