quick-xml icon indicating copy to clipboard operation
quick-xml copied to clipboard

Add documentation for mapping from XML to Rust used by deserializer

Open Mingun opened this issue 3 years ago • 2 comments

This is the my vision of further evolution of the serde integration in this crate. Some parts of this is discussible or maybe even impossible to implement -- this is the first iteration of what I would like to see. For now I'm making this draft PR to:

  • share my vision
  • invitation to discussion
  • creating a roadmap for necessary fixes (marked with FIXME in doctests)

You can get a rustdoc documentation by running

cargo doc --features serialize --open

in the crate root an navigate to quick_xml::de module.

Below the (approximately) rendered version of this proposal:

Mapping XML to Rust types

Type names are never considered when deserializing, so you could name your types as you wish. Other general rules:

  • struct field name could be represented in XML only as attribute name or element name.
  • enum variant name could be represented in XML only as attribute name or element name.
  • the unit struct, unit type () and unit enum variant can be deserialized from any valid XML content:
    • attribute and element names
    • attribute and element values
    • text or CDATA content
  • when deserializing attribute names have precedence over element names. So if your XML have both attribute and element named equally, the Rust field/variant will be deserialized from the attribute.

NOTE: examples, marked with FIXME: do not work yet -- any PRs that fixes that are welcome! The message after marker is a test failure message. Also, all that tests are marked with an ignore option, although their compiles. This is by intention, because rustdoc marks such blocks with an exclamation mark unlike no_run blocks.

To parse all these XML's......use that Rust type
Root tag name do not matter
<any-tag one="..." two="..."/>
<any-tag>
  <one>...</one>
  <two>...</two>
</any-tag>
<any-tag one="...">
  <two>...</two>
</any-tag>

NOTE: such XML's are NOT supported because deserializer will always report a duplicated field error:

<any-tag field="...">
  <field>...</field>
</any-tag>

All these struct can be used to deserialize from specified XML depending on amount of information that you want to get:

// Get both elements/attributes
struct AnyName {
  one: T,
  two: U,
}
// Get only one element/attribute, ignore other
struct AnyName {
  one: T,
}
// Ignore all attributes/elements
// You can also use the `()` type (unit type)
struct AnyName;

A structure where each XML attribute or child element mapped to the field. Each attribute or element name becomes a name of field. Name of the struct itself does not matter.

NOTE: XML allowing you to have an attribute and an element with the same name inside the one element. Such XML's can't be deserialized because serde does not allow you to pass custom properties to the fields and we cannot tell the field on the Rust side, should it be deserialized from the attribute or from the element

An optional XML attributes/elements that you want to capture. The root tag name do not matter.
<any-tag optional="..."/>
<any-tag/>
  <optional>...</optional>
</any-tag>
<any-tag/>

A structure with an optional field.

struct AnyName {
  optional: Option<T>,
}

When the XML attribute or element is present, type T will be deserialized from an attribute value (which is a string) or an element (which is a string or a multi-mapping -- i.e. mapping which can have duplicated keys).

Text content, CDATA content

Text content and CDATA mapped to any Rust type that could be deserialized from a string, for example, String, &str and so on.

NOTE: deserialization to non-owned types (i.e. borrow from the input), such as &str, is possible only if you parse document in the UTF-8 encoding and text content do not contains escape sequences.

An XML with different root tag names.
<one field1="...">...</one>
<two field2="...">...</two>
<one>
  <field1>...</field1>
</one>
<two>
  <field2>...</field2>
</two>

An enum where each variant have a name of the root tag. Name of the enum itself does not matter.

All these types can be used to deserialize from specified XML depending on amount of information that you want to get:

#[serde(rename_all = "snake_case")]
enum AnyName {
  One { field1: T },
  Two { field2: U },
}
type OtherType = ...;
#[serde(rename_all = "snake_case")]
enum AnyName {
  // `field1` contend discarded
  One,
  // OtherType deserialized from the `field2` content
  Two(OtherType),
}
#[serde(rename_all = "snake_case")]
enum AnyName {
  One,
  // the <two> will be mapped to this
  #[serde(other)]
  Other,
}

You should have variants for all possible tag names in your enum or have an #[serde(other)] variant.

<xs:choice> inside of the other element.

<any-tag field="...">
  <one>...</one>
</any-tag>
<any-tag field="...">
  <two>...</two>
</any-tag>
<any-tag>
  <field>...</field>
  <one>...</one>
</any-tag>
<any-tag>
  <two>...</two>
  <field>...</field>
</any-tag>
Names of the enum, struct, and struct field does not matter.
// FIXME: Custom("missing field `$flatten`")
#[serde(rename_all = "snake_case")]
enum Choice {
  One,
  Two,
}
struct AnyName {
  field: ...,

  // Creates problems while deserializing inner
  // types in many cases due to
  // https://github.com/serde-rs/serde/issues/1183
  // #[serde(flatten)]
  /// Field name is ignored if it is renamed to
  /// `$flatten`
  #[serde(rename = "$flatten")]
  any_name: Choice,
}

Due to selected workaround you can have only one flatten field in your structure. That will be checked at the compile time by the serde derive macro.

A sequence with a strict order, probably with a mixed content (text and tags).
<one>...</one>
text
<![CDATA[cdata]]>
<two>...</two>
<one>...</one>

All elements mapped to the heterogeneous sequential type: tuple or named tuple. Each element of the tuple should be able to be deserialized from the nested element content (...), except the enum types which would be deserialized from the full element (<one>...</one>), so they could use the element name to choose the right variant:

// FIXME: Custom("invalid length 3, expected tuple
//                struct AnyName with 5 elements")
type One = ...;
type Two = ...;
# #[derive(Debug, PartialEq, serde::Deserialize)]
struct AnyName(One, String, String, Two, One);
// FIXME: Custom("invalid length 3, expected
//                a tuple of size 5")
#[serde(rename_all = "snake_case")]
enum Choice {
  One,
}
type Two = ...;
type AnyName = (Choice, String, String, Two, Choice);
A sequence with a non-strict order, probably with a mixed content (text and tags).
<one>...</one>
text
<![CDATA[cdata]]>
<two>...</two>
<one>...</one>
A homogeneous sequence of elements with a fixed or dynamic size.
// FIXME: Unsupported("Invalid event for Enum,
//                     expecting `Text` or `Start`")
#[serde(rename_all = "snake_case")]
enum Choice {
  One,
  Two,
  #[serde(other)]
  Other,
}
type AnyName = [Choice; 5];
// FIXME: Custom("unknown variant `text`, expected
//                one of `one`, `two`, `$value`")
#[serde(rename_all = "snake_case")]
enum Choice {
  One,
  Two,
  #[serde(rename = "$value")]
  Other(String),
}
type AnyName = Vec<Choice>;
A sequence with a strict order, probably with a mixed content, (text and tags) inside of the other element.
<any-tag>
  <one>...</one>
  text
  <![CDATA[cdata]]>
  <two>...</two>
  <one>...</one>
</any-tag>

A structure where all child elements mapped to the one field which have a heterogeneous sequential type: tuple or named tuple. Each element of the tuple should be able to be deserialized from the nested element content (...), except the enum types which would be deserialized from the full element (<one>...</one>):

// FIXME: Custom("missing field `$flatten`")
type One = ...;
type Two = ...;
struct AnyName {
  // Does not (yet?) supported by the serde
  // https://github.com/serde-rs/serde/issues/1905
  // #[serde(flatten)]
  /// Field name is ignored if it is renamed to
  /// `$flatten`
  #[serde(rename = "$flatten")]
  any_name: (One, String, String, Two, One),
}
// FIXME: Custom("missing field `$flatten`")
type One = ...;
type Two = ...;
struct NamedTuple(One, String, String, Two, One);
struct AnyName {
  // Does not (yet?) supported by the serde
  // https://github.com/serde-rs/serde/issues/1905
  // #[serde(flatten)]
  /// Field name is ignored if it is renamed to
  /// `$flatten`
  #[serde(rename = "$flatten")]
  any_name: NamedTuple,
}
A sequence with a non-strict order, probably with a mixed content (text and tags) inside of the other element.
<any-tag>
  <one>...</one>
  text
  <![CDATA[cdata]]>
  <two>...</two>
  <one>...</one>
</any-tag>

A structure where all child elements mapped to the one field which have a homogeneous sequential type: array-like container. A container type T should be able to be deserialized from the nested element content (...), except if it is an enum type which would be deserialized from the full element (<one>...</one>):

// FIXME: Custom("missing field `$flatten`")
#[serde(rename_all = "snake_case")]
enum Choice {
  One,
  Two,
  #[serde(rename = "$value")]
  Other(String),
}
struct AnyName {
  // Does not (yet?) supported by the serde
  // https://github.com/serde-rs/serde/issues/1905
  // #[serde(flatten)]
  /// Field name is ignored if it is renamed to
  /// `$flatten`
  #[serde(rename = "$flatten")]
  any_name: [Choice; 5],
}
// FIXME: Custom("missing field `$flatten`")
#[serde(rename_all = "snake_case")]
enum Choice {
  One,
  Two,
  #[serde(rename = "$value")]
  Other(String),
}
struct AnyName {
  // Does not (yet?) supported by the serde
  // https://github.com/serde-rs/serde/issues/1905
  // #[serde(flatten)]
  /// Field name is ignored if it is renamed to
  /// `$flatten`
  #[serde(rename = "$flatten")]
  any_name: Vec<Choice>,
}

Mingun avatar Mar 12 '22 20:03 Mingun

I'm not sure if my approach is wrong or if the following scenario should be added to the considerations. The comments within the code block outline what works vs what would be preferred in my case. It would be nice if the need for the "InnerNested" struct could be removed entirely since, in my case, it creates the need for an extra step to access the Vec containing the "InnerNestedDetail" structs which diverges from the standard in the XML.

use serde::{Deserialize, Serialize};
use quick_xml::de::{from_str, DeError};


#[derive(Debug, Deserialize, Serialize)]
struct InnerNestedDetail {
    #[serde(rename="modification_date",default)]
    modification_date: String, // String just for example
    #[serde(rename="version",default)]
    version: f32,
    #[serde(rename="description",default)]
    description: String,
}


// Prefer to not need this struct
#[derive(Debug, Deserialize, Serialize)]
struct InnerNested{
    #[serde(rename = "Inner_Nested_Detail")] // not sure how this attribute's resulting functionality can be implemented in the inner nested field of the OuterTag struct 
   details: Vec<InnerNestedDetail>
}


#[derive(Debug, Deserialize, Serialize)]
pub struct OuterTag {
    identifier: String,
    version: f32,
    #[serde(rename = "Inner_Nested")] 
    inner_nested: InnerNested, // Prefer to have this field as Vec<InnerNestedDetail> and remove the InnerNested struct altogether. The Inner_Nested_Detail cannot be accessed without knowing its outer tag, InnerNested, first, but this adds an extra step for accessing the data when the structs are populated
}

fn parse_xml(xml_string:&str) -> Result<OuterTag,DeError> {
    let test: OuterTag =  from_str(xml_string)?;
    Ok(test)
}

fn main(){
    let xml_string = "<Outer_Tag>
            <identifier>Some Identifier</identifier>
            <version>2.0</version>
            <Inner_Nested>
                <Inner_Nested_Detail>
                    <modification_date>2022-04-20</modification_date>
                    <version>1.0</version>
                    <description>Initial version.</description>
                </Inner_Nested_Detail>
                <Inner_Nested_Detail>
                    <modification_date>2022-05-20</modification_date>
                    <version>2.0</version>
                    <description>Modified version.</description>
                </Inner_Nested_Detail>
            </Inner_Nested>
        </Outer_Tag>";
    let result = parse_xml(xml_string).unwrap();

    println!("{:#?}",result);

    // In order to access the inner nested details vec, an extra step is necessary 
    println!("\nInner Nested Details Vec Current Access:\n{:#?}",result.inner_nested.details);

    // Preferred vec access
    //println!("\nInner Nested Details Vec Preferred Access:\n{:#?}",result.inner_nested);


}
'''

RodogInfinite avatar May 28 '22 16:05 RodogInfinite

Because inner_nested field represents the Inner_Nested tag of your XML, it is not possible to just skip it in a trivial mapping. But you always can write a simple wrapper for use it with #[serde(with)] that will unpack sequence from the container: see that https://github.com/tafia/quick-xml/issues/365#issuecomment-1120253466

You can also look at the https://lib.rs/crates/serde-query if you looking for a more generic solution. I think I'll mention both alternatives in the final version of the doc.

Mingun avatar May 28 '22 17:05 Mingun

@Mingun Is this PR still relevant?

dralley avatar Nov 19 '22 05:11 dralley

Technically I'll open a new PR because I cannot change this one due to it is from the other repository and GitHub don't allow me to change it.

Mingun avatar Nov 19 '22 08:11 Mingun