serde-xml-rs
serde-xml-rs copied to clipboard
Deserialization for elements and attributes with ":" in name
This is a considerable problem when trying to parse DTDs. Replacing the colons with underscores allows for parsing otherwise.
Could you please provide more details?
https://raw.githubusercontent.com/HUPO-PSI/mzML/master/schema/schema_1.1/mzML1.1.0.xsd
The xml: and xs: in tag and attribute names would not parse (using Serde rename to set the exact name for a Rust struct member). After renaming with _ instead of :, I was able to parse it.
I mean, details on how do you see this working on the Rust side? Obviously :
is not a valid part of a Rust identifier, so you still need to either ignore part before :
as library currently does or emit full name but then force every consumer of such name to use serde(rename)
. Not sure which one is better. Or do you have other suggestions?
In the old xml parser I simply stripped such prefixes. I have no idea what they even do, and if everyone just ignores them anyway, I don't see a reason to keep them.
Yeah, that's what I'm doing too.
@oli-obk While you're here - could you please reply to https://github.com/serde-rs/xml/issues/35#issuecomment-343310659? I left it a while ago but still don't know if it's desirable :)
@RReverser I'm accustomed to serde libraries using #[serde(rename)] for such cases rather than throwing out part of the identifier. That's what I've generally done with csv files for sure, but I think I've had the problem with json as well.
A bit of googling shows that part of the identifier is used for namespaces, so beyond being counterintuitive (at least to me) it seems like this would lead to potential name collisions and prevent validation (items with a wrong or nonexistent namespace could not get distinguished from those in the expected nanespace)
@dtolnay may have a better overview of what libraries tend to do however
I've encountered a problem with wordpress xml which has
content:encoded
and excerpt:encoded
tags. I'm getting:
Error(Custom("duplicate field `encoded`")
See https://gist.github.com/iwek/3977831
I did a workaround for now:
pub encoded: Vec<String>, // encoded[0] is `content:encoded`
Though it's not very reliable.
Hello,
stumbled on this issue. Another way to reproduce is also an XML such as:
<title>This is the title</title>
<itunes:title>This is the repeated title because why not</itunes:title>
so no suggested workaround works (using serde rename
or parsing the field as an array).
Can you advice on the suggested way to cope with this without touching the source XML? Sadly, I'm already thinking to pre-process the XML as suggested by @spease or dropping this crate and directly XML parsing (f.e. with quick-xml
).
Thank you in advance for any hint (I just tried this library, so I may have missed something).
EDIT: on a second thought, this seems to work
#[serde(rename = "itunes:title", default)]
title: String,
and ignore the other title
field.
The problem is that this library has very limited support for namespaces. The deserializer will ignore the namespace. The serializer is currently incapable of generating a document using namespaces.
@apiraino, are you sure that the title
field is really filled using your workaround? I think that it will always contain the default value which is an empty string.
hey @punkstarman thanks for the reply. Damn, you're right the field is set to a default empty string ( I was confused by too many fields).
Besides the lack of support for namespaces (which is a feature), the real issue I see is that the parser panics when a tag with namespace is found. is there a way to avoid this? I'd avoid to pre-process the XML.
I hope that this library lifecycle will move forward, it's actually the only good option to work on XML files the way we're used with serde. Thanks for working on this library!
The parser doesn't panic when it encounters a tag with namespace. It just lops off the namespace part and produces a field with the remainder. The parser panics when it tries to fit two XML elements with the same name into a single Rust struct field that is of collection type.
This seems like it should be an error rather than a panic. An application could recover from it.
@spease , panic was a poor choice of words. It is in fact an error (for example see https://github.com/RReverser/serde-xml-rs/issues/64#issuecomment-477991946).
I am trying to parse an RSS feed and the part that is relevant looks like this
<link>...</link>
<atom:link href="..." rel="self" type="application/rss+xml"/>
I want to get the value of link
.
The thread doesn't seem to have a concrete solution for this but posting anyway in case someone came up with one and just didn't reply.
I tried setting link
to a String vector but that doesn't work for me.
Is there perhaps a different workaround for this since the atom:link
element does not even have a value unlike link
?
Any news on this?