serde
serde copied to clipboard
Untagged enum deserializing expects human-readable representation even with non human-readable data formats
Some data types like std::net::IpAddr
has two different representations for human-readable and non-human-readable Deserializers. However, the built-in deserializers that work with enums do not recognize human-readability of the data format, and expect a representation different from what serializers produce. The following example shows a round trip serialization with msgpack
that fails during deserialization.
// [dependencies]
// rmp-serde = "1"
// rmpv = "1"
// serde = { version = "1", features = ["derive"] }
use serde::{Deserialize, Serialize};
use std::{io::Cursor, net::IpAddr};
#[derive(Debug, Serialize, Deserialize)]
#[serde(untagged)]
enum HostName {
Ip(IpAddr),
DomainName(String),
}
fn main() {
// Serialize into MessagePack bytes.
let mp_data = rmp_serde::to_vec(&HostName::Ip([8, 8, 8, 8].into())).unwrap();
// Inspect the data structure
let value = rmpv::decode::value::read_value(&mut Cursor::new(mp_data.clone())).unwrap();
println!("{}", value); // {"V4": [8, 8, 8, 8]}
// Deserialize back
rmp_serde::from_slice::<HostName>(&mp_data).unwrap(); // Panic!
// A MessagePack encoded string "1.1.1.1"
let mp_data = vec![0xA7u8, 0x31, 0x2E, 0x31, 0x2E, 0x31, 0x2E, 0x31];
println!("{:?}", rmp_serde::from_slice::<HostName>(&mp_data).unwrap()); // Ip(1.1.1.1)
}
In this example, a ContentRefDeserializer
is used behind the scene to pick a suitable variant of HostName
. The implementation does not override is_human_readable
method on Deserializer
trait, thus uses the default value true
. Therefore, the non-human-readable representation of IpAddr {"V4": [8, 8, 8, 8]}
is rejected. Any suggestions?
PS: #1919 seems related.
replying to @dtolnay's related comment on https://github.com/serde-rs/serde/pull/1919#pullrequestreview-600321475:
I'll go ahead and close this in the absence of a use case.
@dtolnay: a concrete use case is the zerovec
crate, which uses a different serialization format depending on whether the (de)serializer is human readable: https://github.com/unicode-org/icu4x/blob/0e395d9c03f8d98bd3c37a0bba7bced4d5511529/utils/zerovec/src/zerovec/serde.rs#L72-L76
Unfortunately, it's really easy to embed ZeroVec
and similar types within, e.g., an untagged enum that then fails to deserialize due to is_human_readable
not being propagated through ContentRefDeserializer
and related internal deserializers.
Any type that has a Deserialize
implementation that depends on the deserializer's is_human_readable
field is going to run into this issue. None of these types can be safely embedded within untagged enums, and probablly can't use #[serde(flatten)]
or other features that also depend in intermediate deserializers. It makes sense that is_human_readable
was intended to be best-effort, but unfortunately it's being relied upon within the ecosystem. Worse, the bugs that result from this issue aren't discoverable until runtime, when round-trip deserialization fails.
While it's not ideal, I think the most straightforward solution here is to propagate is_human_readable
through all of the intermediate deserializers whenever possible.
cc: @Manishearth
Wait, why is this a problem with flatten? Flatten would be straightforward, no?
#[serde(flatten)]
uses FlatMapDeserializer
, which is one of the internal deserializers that doesn't propagate is_human_readable
from the underlying deserializer: https://github.com/serde-rs/serde/blob/fb2fe409c8f7ad6c95e3096e5e9ede865c8cfb49/serde_derive/src/de.rs#L2574
Oh, wait, flatten's different from transparent. Never mind.
We've also been hit by this. We're using rmp_serde
in combination with enum_map
crate. Simple code like below, fails.
use enum_map::*;
use serde::*;
fn main() {
let val = Foo {
bar: Bar {
baz: enum_map! {_ => ()},
},
};
let buf = rmp_serde::encode::to_vec(&val).unwrap();
// panic here
let _: Foo = rmp_serde::decode::from_slice(&buf).unwrap();
}
#[derive(Serialize, Deserialize)]
struct Foo {
#[serde(flatten)]
bar: Bar,
}
#[derive(Serialize, Deserialize)]
struct Bar {
baz: EnumMap<bool, ()>,
}
We have spent some time trying to figure out why. In enum_map
there's code like this:
/// Requires crate feature `"serde"`
impl<K: EnumArray<V> + Serialize, V: Serialize> Serialize for EnumMap<K, V> {
fn serialize<S: Serializer>(&self, serializer: S) -> Result<S::Ok, S::Error> {
if serializer.is_human_readable() {
serializer.collect_map(self)
} else {
let mut tup = serializer.serialize_tuple(self.len())?;
for value in self.values() {
tup.serialize_element(value)?;
}
tup.end()
}
}
}
/// Requires crate feature `"serde"`
impl<'de, K, V> Deserialize<'de> for EnumMap<K, V>
where
K: EnumArray<V> + EnumArray<Option<V>> + Deserialize<'de>,
V: Deserialize<'de>,
{
fn deserialize<D: Deserializer<'de>>(deserializer: D) -> Result<Self, D::Error> {
if deserializer.is_human_readable() {
deserializer.deserialize_map(HumanReadableVisitor(PhantomData))
} else {
deserializer.deserialize_tuple(K::LENGTH, CompactVisitor(PhantomData))
}
}
}
After adding some debug, it turns out that it does NOT go into serializer.is_human_readable()
branch but when decoding, it does go into deserializer.is_human_readable()
branch. So in our case, it writes a sequence and then tries to decode a map and it fails.
Removing #[serde(flatten)]
attribute makes the round-trip serialisation work.
#1919 seems very related but it seems to have been closed: I don't understand why. Clearly, this is a bug right? It was explained to be so in https://github.com/serde-rs/serde/pull/1919#issuecomment-787594062 and flatten is specifically mentioned there.
Reason it doesn't work is that if #[serde(flatten)]
or #[serde(untagged)]
or other such are used then the real deserializer gets replaced by ContentDeserializer
or ContentRefDeserializer
to allow attempting to deserialize the data multiple times. These other deserializers do not carry over the is_human_readable()
flag from the original deserializer. It is a bug in serde, quite clearly, if the is_human_readable()
flag is supposed to work.