serde icon indicating copy to clipboard operation
serde copied to clipboard

Avoid lossy buffering in #[serde(flatten)]

Open sfackler opened this issue 2 years ago • 12 comments

The Deserialize implementation of a type with a field annotated #[serde(flatten)] currently buffers up all of the unknown map entries in memory, and then deserializes the flattened value from them at the end. This approach has a couple of downsides:

  1. It causes the deserializer to allocate when that would normally not be the case.
  2. It confuses "smart" deserializers - things like serde_ignored_value or serde_json's error position tracking don't work properly.
  3. It can lose out on special casing in deserializers based on hinting.

Fortunately, there's alternate approach that avoids all of these issues! Instead of buffering, we can create a custom MapAccess which intercepts fields that we want to handle in the top-level type, and then returning the remainder to the nested deserialize implementation: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=96546d7ba882b70042268a0b48de6286

sfackler avatar Mar 09 '22 02:03 sfackler

This is a very interesting idea. I have a couple of remarks / incompatibility observations. Maybe some of them can be resolved.

Currently, the Foo type ultimately decides if a field name is known or unknown. This leads to an incompatibility currently, if deny_unknown_fields is added to Foo. With the normal flatten extra fields will be ignored, since ExtendedFoo allows them. https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=4e249bb1961ff44945606ed6bdcdab7d

I think that can be fixed by adding an Ignore variant to ExtendedFooFields, similar to the current derive. An unknown field error from the seed of ExtendedFooFieldsVisitor must be converted into the Ignore variant, but all other error likely should remain. Inspecting the errors might be complicated, since they are very opaque.

It is unclear to me how this could be extended to more than one flattened field. https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=7451b4980894d2ddda33b7ce3fc0aa9c An additional complication here is the fact, that the same value can be deserialized multiple times. A map visitor (i.e., Value) will not consume the fields from the internal buffer, but a struct visitor (i.e., Sub) will. This leads the second Value to share all fields with the first except those consumed by the Sub visitor.

Output
[src/main.rs:27] d = Data {
    f0: Object({
        "x": String(
			"foo",
        ),
        "y": String(
            "bar",
        ),
        "z": String(
            "baz",
        ),
    }),
    a: 123,
    f1: Sub {
        y: "bar",
    },
    b: 456,
    f2: Object({
        "x": String(
            "foo",
        ),
        "z": String(
            "baz",
        ),
    }),
}

jonasbb avatar Mar 09 '22 17:03 jonasbb

An unknown field error from the seed of ExtendedFooFieldsVisitor must be converted into the Ignore variant, but all other error likely should remain. Inspecting the errors might be complicated, since they are very opaque.

We should be able to handle this by making a custom error type used by the MapAccess implementation that intercepts unknown field errors and delegates the rest to the inner MapAccess's error type.

It is unclear to me how this could be extended to more than one flattened field.

Wow, I was not aware that was a thing you could do! I think you are stuck buffering there to allow the key value to be deserialized multiple times unfortunately. We could continue using the old logic in that case.

sfackler avatar Mar 09 '22 17:03 sfackler

Hmm - we can't actually filter the errors since the error will terminate the deserialization of the inner value entirely...

EDIT: Ah - we may be able to take an approach something like what FlatStructAccess does of remembering the delegate's field list and checking that in our ExtendedFooFieldsSeed.

sfackler avatar Mar 09 '22 21:03 sfackler

On the other hand, the docs on serde.rs do explicitly state that flatten isn't compatible with deny_unknown_fields: https://serde.rs/field-attrs.html#flatten.

sfackler avatar Mar 09 '22 21:03 sfackler

Here's a working implementation of that approach: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=b9b5b874c6f4c93d6d82c1bed4e6451e. If you did want to allow unknown fields in ExtendedFoo, we'd codegen the extra Ignored variant for ExtendedFooFields and return that instead of an error in ExtendedFooFieldsSeed.

sfackler avatar Mar 10 '22 01:03 sfackler

I think, that using a json is a bad example to demonstrate absence of "lossy deserialization", because JSON contains some information about value type. Try some untyped format, like XML (but there are no good XML serde adapter, so using some toy format would be better)

Mingun avatar Mar 10 '22 04:03 Mingun

I coincidentally made a fully working implementation of this this evening before even seeing this thread! You don't need to depend at all on looking for particular deserialize errors / missing_field notifications from the flattened inner type, because you already know ahead of time what the names of the fields are that you're interested in. You can minimize codegen by creating a new, private trait (I called mine KeyCapture):

pub enum KeyOutcome<T> {
    Accepted(T),
    Rejected,
}

pub trait KeyCapture<'de> {
    // Returned from `try_send_key` to communicate state to `send_value`.
    type Token;

    fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result;

    fn try_send_key<E>(&mut self, key: &[u8]) -> Result<KeyOutcome<Self::Token>, E>
    where
        E: de::Error;

    fn send_value<D>(&mut self, token: Self::Token, value: D) -> Result<(), D::Error>
    where
        D: de::Deserializer<'de>;
}

In the struct macro codegen, you implement this trait such that it "rejects" keys that are not part of the struct; these keys can then be transparently forwarded to the type being flattened. Once you have this trait, you can use a series of private, reusable adapter types to deserialize the inner flattened field while simultaneously sending data into the KeyCapture. Some snippets:

// From MapAccess:
    fn next_key_seed<K>(&mut self, mut seed: K) -> Result<Option<K::Value>, Self::Error>
    where
        K: de::DeserializeSeed<'de>,
    {
        // We're extracting a key for the type. Try to send keys to the `KeyCapture`,
        // and the first key that is rejected by the `KeyCapture` are is instead to the
        // `seed`
        loop {
            seed = match self.map.next_key_seed(ExtractKeySeed {
                seed,
                capture: &mut self.capture,
            })? {
                None => {
                    self.drained = true;
                    return Ok(None);
                }
                Some(ExtractKeySeedResult::Deserialized(value)) => return Ok(Some(value)),
                Some(ExtractKeySeedResult::Captured { seed, token }) => {
                    self.map.next_value_seed(ExtractValueSeed {
                        token,
                        capture: &mut self.capture,
                    })?;

                    seed
                }
            };
        }
    }

// From Visitor:
    fn visit_str<E>(self, v: &str) -> Result<Self::Value, E>
    where
        E: de::Error,
    {
        match self.capture.try_send_key(v.as_bytes())? {
            KeyOutcome::Accepted(token) => Ok(ExtractKeySeedResult::Captured {
                seed: self.seed,
                token,
            }),
            KeyOutcome::Rejected => self
                .seed
                .deserialize(v.into_deserializer())
                .map(ExtractKeySeedResult::Deserialized),
        }
    }

Here's what an example codegen might look like for a type called Data2:

struct Data2 {
    key3: String,

    // #[serde(flatten)]
    inner: Data1,
}

impl<'de> de::Deserialize<'de> for Data2 {
    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
    where
        D: serde::Deserializer<'de>,
    {
        enum Field {
            key3
        }

        #[derive(Default)]
        struct Data2Capture {
            key3: Option<String>,
        }

        impl<'de, 'a> KeyCapture<'de> for &'a mut Data2Capture {
            type Token = Field;

            fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
                write!(formatter, "a Data2 struct")
            }

            #[inline]
            fn try_send_key<E>(&mut self, key: &[u8]) -> Result<KeyOutcome<Field>, E>
            where
                E: de::Error,
            {
                Ok(match key {
                    b"key3" => KeyOutcome::Accepted(Field::key3),
                    _ => KeyOutcome::Rejected,
                })
            }

            #[inline]
            fn send_value<D>(&mut self, token: Self::Token, value: D) -> Result<(), D::Error>
            where
                D: de::Deserializer<'de>,
            {
                match token {
                    Field::key3 => de::Deserialize::deserialize(value).map(|value| {
                        self.key3 = Some(value);
                    }),
                }
            }
        }

        let mut capture = Data2Capture::default();

        let inner = Data1::deserialize(ExtractDeserializer {
            deserializer,
            capture: &mut capture,
        })?;

        let key3 = match capture.key3 {
            Some(key3) => key3,
            None => return Err(de::Error::missing_field("key3")),
        };

        Ok(Self { inner, key3 })
    }

I was additionally able to determine that it is outright impossible (without buffering) to have more than one flattened field. In order to pull that off, you'd need to implement Deserialize for this primitive in a way that spreads the keys out to both fields:

struct MapPair<T1, T2> { pub map1: T1, pub map2: T2 }

You can put together a series of wrapper types and get pretty far, but eventually you end up at this point:

    // We have a pair of visitors, visitor1 and visitor2. We build our own sequence type and 
    // visit it to sequence 1:
    fn next_key_seed<K>(&mut self, mut seed: K) -> Result<Option<K::Value>, Self::Error>
    where
        K: de::DeserializeSeed<'de>
    {
        // `seed` is the receiver for `map1`'s value, and we have a `visitor2` lying around.
        // At this point, we need to loop over the internal sequence type to try to send keys
        // to `visitor2`, and detect an `unrecognized_field` error to feed to the seed.
        // Unfortunately the only interface we have to `visitor2` is `visit_seq`, which must
        // build a completed `map2`. We're at an impasse.
    }

It's basically the same reason that you can't unpack an Iterator of (T1, T2) into a pair of types that are FromIterator<T1>, FromIterator<T2>.

Lucretiel avatar Mar 10 '22 07:03 Lucretiel

Completed implementation for consideration: https://github.com/Lucretiel/serde-bufferless

Lucretiel avatar Mar 12 '22 04:03 Lucretiel

I think it is better to fix this generically, not just for the case with one flattened field, because in that case it creates a friction when users realising that small change in definitions of their types will broke deserialization.

To achieve that, changes only on deserializer side is not enough. I work on this problem, and you can see my unfinished work at https://github.com/serde-rs/serde/compare/master...Mingun:flatten. The actual changes begins from 1071b39e63077789beb9e55af7384462db72a3bd, the other is a prepare work.

The key idea in that #[derive(Deserialize)] implements additional trait DeserializeEmbed on types that could be flattened. This approach solves many problems:

  • you no longer can add #[serde(flatten)] on fields of non-flattanable types, such as String, because DeserializeEmbed is not implemented for strings (see eac15662a5e045918e93e64b5a7215fe12699307)
  • you now deserialize each field only once (not consuming flattened fields from deserializer is a mistake)
  • you can pass flattened field to any level (when you flatten in flatten in flatten...)
  • you no longer need a hack for support Option in untagged enums: https://github.com/serde-rs/serde/blob/c3ce2c934a9123b5a309256d65cca6b6d19466ec/serde/src/de/mod.rs#L1672-L1679
  • I expect that almost all flatten problems will be solved by this

I do not remember the exact status of readiness of this, but I remember that only one of the enum representation options remained to be implemented (seems untagged representation).

Chosen approach has downsides, but their are not so big:

  • the main downside is that flattenable/non-flattenable is a trait of a type, not a deserialize operation, but DeserializeEmbed derivered on operation basis. This is the fundamental restriction, because I trying to make this invasive as minimal as I possible and I almost approached that: except two scenarios, described below the end-user don't need to change in their code nothing
  • strictly speaking, this is a breaking changes, because your code could stop to compile after this change. But serde anyway doesn't follow semver, so the users should be prepare for that (of course, I supposed to include this only in serde 2 and start to follow semver).

The two scenarios is following:

  1. User implement Deserialize manually and that type is flattened into another type. I think this is very rare case
  2. User try to flatten field of type that cannot be flattened, such of String example above. This is in any case an error in the user code that anyway should be fixed, because currently such types will always deserialized with an error which, obviously, not what the user want. Unfortunately, because DeserializeEmbed implemented only in #[derive(Deserialize)], having only #[derive(Serialize)] on the type still allow you to create an erroneous type. I don't know, is it possible to automatically implement some trait twice from different derives without ambiguity, which is by good required here

Mingun avatar Mar 12 '22 12:03 Mingun

Ha, reinvented the same wheel (deserializing with just one flat field without buffers) recently and only now discovered this thread. I suppose I should try to switch to serde-bufferless linked above instead.

RReverser avatar Apr 02 '23 17:04 RReverser

@Lucretiel Any chance you'd want to add a proc-macro to your repo and publish it to crates.io?

RReverser avatar May 11 '23 22:05 RReverser