serde-xml-rs icon indicating copy to clipboard operation
serde-xml-rs copied to clipboard

Deserialization for elements and attributes with ":" in name

Open spease opened this issue 7 years ago • 16 comments

This is a considerable problem when trying to parse DTDs. Replacing the colons with underscores allows for parsing otherwise.

spease avatar Feb 20 '18 05:02 spease

Could you please provide more details?

RReverser avatar Feb 20 '18 22:02 RReverser

https://raw.githubusercontent.com/HUPO-PSI/mzML/master/schema/schema_1.1/mzML1.1.0.xsd

The xml: and xs: in tag and attribute names would not parse (using Serde rename to set the exact name for a Rust struct member). After renaming with _ instead of :, I was able to parse it.

spease avatar Feb 21 '18 02:02 spease

I mean, details on how do you see this working on the Rust side? Obviously : is not a valid part of a Rust identifier, so you still need to either ignore part before : as library currently does or emit full name but then force every consumer of such name to use serde(rename). Not sure which one is better. Or do you have other suggestions?

RReverser avatar Feb 23 '18 10:02 RReverser

In the old xml parser I simply stripped such prefixes. I have no idea what they even do, and if everyone just ignores them anyway, I don't see a reason to keep them.

oli-obk avatar Feb 23 '18 10:02 oli-obk

Yeah, that's what I'm doing too.

RReverser avatar Feb 23 '18 10:02 RReverser

@oli-obk While you're here - could you please reply to https://github.com/serde-rs/xml/issues/35#issuecomment-343310659? I left it a while ago but still don't know if it's desirable :)

RReverser avatar Feb 23 '18 10:02 RReverser

@RReverser I'm accustomed to serde libraries using #[serde(rename)] for such cases rather than throwing out part of the identifier. That's what I've generally done with csv files for sure, but I think I've had the problem with json as well.

A bit of googling shows that part of the identifier is used for namespaces, so beyond being counterintuitive (at least to me) it seems like this would lead to potential name collisions and prevent validation (items with a wrong or nonexistent namespace could not get distinguished from those in the expected nanespace)

@dtolnay may have a better overview of what libraries tend to do however

spease avatar Feb 23 '18 18:02 spease

I've encountered a problem with wordpress xml which has content:encoded and excerpt:encoded tags. I'm getting:

Error(Custom("duplicate field `encoded`")

See https://gist.github.com/iwek/3977831

TatriX avatar Mar 29 '19 13:03 TatriX

I did a workaround for now:

   pub encoded: Vec<String>, // encoded[0] is `content:encoded`

Though it's not very reliable.

TatriX avatar Mar 29 '19 13:03 TatriX

Hello,

stumbled on this issue. Another way to reproduce is also an XML such as:

<title>This is the title</title>
<itunes:title>This is the repeated title because why not</itunes:title>

so no suggested workaround works (using serde rename or parsing the field as an array).

Can you advice on the suggested way to cope with this without touching the source XML? Sadly, I'm already thinking to pre-process the XML as suggested by @spease or dropping this crate and directly XML parsing (f.e. with quick-xml).

Thank you in advance for any hint (I just tried this library, so I may have missed something).

EDIT: on a second thought, this seems to work

    #[serde(rename = "itunes:title", default)]
    title: String,

and ignore the other title field.

apiraino avatar Oct 09 '19 16:10 apiraino

The problem is that this library has very limited support for namespaces. The deserializer will ignore the namespace. The serializer is currently incapable of generating a document using namespaces.

@apiraino, are you sure that the title field is really filled using your workaround? I think that it will always contain the default value which is an empty string.

punkstarman avatar Oct 09 '19 19:10 punkstarman

hey @punkstarman thanks for the reply. Damn, you're right the field is set to a default empty string ( I was confused by too many fields).

Besides the lack of support for namespaces (which is a feature), the real issue I see is that the parser panics when a tag with namespace is found. is there a way to avoid this? I'd avoid to pre-process the XML.

I hope that this library lifecycle will move forward, it's actually the only good option to work on XML files the way we're used with serde. Thanks for working on this library!

apiraino avatar Oct 09 '19 21:10 apiraino

The parser doesn't panic when it encounters a tag with namespace. It just lops off the namespace part and produces a field with the remainder. The parser panics when it tries to fit two XML elements with the same name into a single Rust struct field that is of collection type.

punkstarman avatar Oct 10 '19 07:10 punkstarman

This seems like it should be an error rather than a panic. An application could recover from it.

spease avatar Oct 10 '19 16:10 spease

@spease , panic was a poor choice of words. It is in fact an error (for example see https://github.com/RReverser/serde-xml-rs/issues/64#issuecomment-477991946).

punkstarman avatar Oct 11 '19 08:10 punkstarman

I am trying to parse an RSS feed and the part that is relevant looks like this

<link>...</link>
<atom:link href="..." rel="self" type="application/rss+xml"/>

I want to get the value of link.

The thread doesn't seem to have a concrete solution for this but posting anyway in case someone came up with one and just didn't reply.

I tried setting link to a String vector but that doesn't work for me.

Is there perhaps a different workaround for this since the atom:link element does not even have a value unlike link?

Any news on this?

AntoniosBarotsis avatar Aug 30 '22 16:08 AntoniosBarotsis