quick-xml icon indicating copy to clipboard operation
quick-xml copied to clipboard

Add ability to deserialize serde types from `Reader`

Open ndtoan96 opened this issue 1 year ago • 6 comments

When working with deeply nested xml, most of the time, we are only interested in a portion of the whole tree close to the leaf node. My idea is to extract the string of the target node and deserialize it with serde. But I can't find any convenient way to do that.

Currently I use read_text to get the inner content of the node and add the start and end tag manually, but then the code looks really weird, especially when the node has many attributes. It would be great if there's a method (read_node or something) to do that.

By the way, is there any reason why read_text is not implemented for Reader<File>?

ndtoan96 avatar Jun 05 '23 13:06 ndtoan96

Having a deserialize method for Reader that would be able to deserialize piece of XML into a type using serde from current position is definitely a feature I also want -- as a counterpart to #610. Implementation, however, not so simple, because serde deserializer requires some (potentially unbounded) lookahead, therefore we need to buffer events somewhere.

The possible API could look something like this:

impl<'a> Reader<&'a [u8]> {
  fn deserialize<T>(&mut self, seed: Event<'a>) -> Result<T, DeError>
  where
    T: Deserialize<'a>,
  {}
}

impl<R: Read> Reader<R> {
  fn deserialize_into<'de, T>(&mut self, seed: Event<'de>, buffer: &'de mut Vec<u8>) -> Result<T, DeError>
  where
    T: Deserialize<'de>,
  {}
}

The seed here is an event that we got from Reader in typical read cycle which likely will be a part of the type that we want to deserialize.

Another possible API (very schematic):

impl<R> Reader<R> {
  fn deserializer(&mut self, seed: Event) -> FragmentDeserializer { ... }
}

struct FragmentDeserializer { ... }
impl FragmentDeserializer {
  fn deserialize<T>(self) -> Result<T, DeError>
  where
    T: Deserialize<'a>,
  {}
  fn deserialize_into<'de, T>(self, buffer: &'de mut Vec<u8>) -> Result<T, DeError>
  where
    T: Deserialize<'de>,
  {}
}

Another question, in what state we should leave Reader if deserialization fails? Or how we should provide access to an events that was consumed during lookahead, but not used to deserialize the final type? What if we want to call deserialize twice -- then we should to consider lookaheaded events from the first deserialize call. Probably we need a more generic API:

impl<R> Reader<R> {
  /// Convert to a reader that can store up to `count` events in the internal buffer
  fn lookahead(self, count: usize) -> LookaheadReader<R> { ... }
}

impl<'de, 'a, R> Deserializer<'de> for &'a mut LookaheadReader<R> { ... }

Mingun avatar Jun 05 '23 18:06 Mingun

By the way, is there any reason why read_text is not implemented for Reader<File>?

It is not trivial to do that, because we cannot just reuse read_to_end_into method -- it stores into buffer only content of the tags, but skips markup characters (<, > and so on). The attempts to implement it tracked in #483.

Mingun avatar Jun 05 '23 19:06 Mingun

I would also like this. Go makes it easy to mix pull based parsing with a state machine and deserializing structs:

	decoder := xml.NewDecoder(r.Body)
	decoder.Strict = true
	for {
		switch se := t.(type) {
		case xml.StartElement:
			level++
			switch se.Name.Local {
			case "fooTag":
				var req schema.FooRequest
				decoder.DecodeElement(&req, &se)
				// do stuff
			case "barRequest":
				var req schema.BarRequest
				err = decoder.DecodeElement(&req, &se)
				// do stuff
                     }
		case xml.EndElement:
			level--
		}
	}
}

I could live with an implementation that ties the lifetime of the Reader and the deserialized object to the source lifetime, i.e. only applies to readers backed by a &str.

tstenner avatar Aug 06 '24 12:08 tstenner