xml_builder
xml_builder copied to clipboard
Generating XML values for ASCII characters like newline or cr
Amazon s3 key names with ASCII chars like \n or \r are expected to be mapped in XML data to like "& # 13; " or "& # 10 ;" (added spaces to ensure all would be seen here) or the like. Can I motivate XmlBuilder to generate that? When I use XmlBuilder.generate with character data with those special characters in it, it is not happening like this:
iex(17)> {:person, %{id: 12345}, "Josh\n"} |> XmlBuilder.generate
"<person id=\"12345\">Josh\n</person>"
...where I would like it to generate:
"<person id=\"12345\">Josh </person>"
I looked in the tests to see if there are any examples of this and I did not see any...
Thanks in advance for any help on this.
Hi @cmarkle,
According to the XML spec:
Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646.
Can you explain the use case, what you are trying to accomplish?
@joshnuss Sorry should have been more specific about why I want to do this. We are pulling Amazon S3 bucket data, specifically a list of objects in the bucket, which is returned as XML. If our customer managed to make an object in with a '\r' (or other funky ASCII character) in the object key name (which is do-able, although not recommended, in S3), then we'd like the accurate name (by "accurate" I mean '\r' not altered/mapped to '\n') so that we can turn around and use that object name is something like a delete request. As it is right now we can't delete these funkily-named objects.
Here's a specific example of what I am talking about:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Name>bucket-name-obfuscated</Name>
<Prefix></Prefix>
<NextContinuationToken>1ILUW_obfuscated_YlD9k0Yuf7RxD4ArX1yMUM</NextContinuationToken>
<KeyCount>1000</KeyCount>
<MaxKeys>1000</MaxKeys>
<Delimiter></Delimiter>
<IsTruncated>true</IsTruncated>
<Contents>
<Key>workspaces/51109/packages/ND-7H46rSQ.asp-package/contents/WHALE -> MH/Whale_TTB_DOCUMENTS/Aspera Inboxes/Icon </Key>
<LastModified>2022-03-25T23:52:22.122Z</LastModified>
<ETag>"d41d8cd98f00b204e9800998ecf8427e"</ETag>
<Size>0</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>
</ListBucketResult>
Note the " " ('\r') at the end of the Key.
Thanks for the clarification,
When your parser parses , does it convert it to \r?
Because running \r through this library should be preserved (I believe)
@joshnuss Our parser (which right now is SweetXml) does convert to \r, but then maps the single \r' to newline \n', which is its own problem in my case.
Running \r through XmlBuilder does preserve the \r:
iex(13)> {:person, %{id: 12345}, "Josh\n"} |> XmlBuilder.generate
"<person id=\"12345\">Josh\n</person>"
iex(14)> {:person, %{id: 12345}, "Josh\r"} |> XmlBuilder.generate
"<person id=\"12345\">Josh\r</person>"
AWS guidance on this is documented in "Creating object key names (https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html) - see in particular the section "XML related object key constraints". Duplicating that here:
XML related object key constraints
As specified by the [XML standard on end-of-line handling](https://www.w3.org/TR/REC-xml/#sec-line-ends),
all XML text is normalized such that single carriage returns (ASCII code 13) and carriage returns
immediately followed by a line feed (ASCII code 10) are replaced by a single line feed character. To
ensure the correct parsing of object keys in XML requests, carriage returns and [other special characters
must be replaced with their equivalent XML entity code](https://www.w3.org/TR/xml/#syntax) when they
are inserted within XML tags. The following is a list of such special characters and their equivalent entity
codes:
' as '
” as "
& as &
< as <
> as >
\r as or 
\n as or 

The following example illustrates the use of an XML entity code as a substitution for a carriage return.
This DeleteObjects request deletes an object with the key parameter:
/some/prefix/objectwith\rcarriagereturn (where the \r is the carriage return).
<Delete xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Object>
<Key>/some/prefix/objectwith carriagereturn</Key>
</Object>
</Delete>
So net-net, if there was a way that XmlBuilder could be motivated to do this for \r and \n like it does for other special characters, that would be great:
iex(15)> {:person, %{id: 12345}, "Josh<>\r"} |> XmlBuilder.generate
"<person id=\"12345\">Josh<>\r</person>"
We could add it to the list of escaped string patterns, see escape_string/1:
https://github.com/joshnuss/xml_builder/blob/5b5ae47116259b426c7602692e75b320ba579486/lib/xml_builder.ex#L403-L409
Do you wanna open a PR? Also, I'm thinking to wait on merging it until there are more requests about this fix.
Keep in mind, this is a package that many apps are dependent on, so I hesitate to change the output and cause extra working for anyone.