xml_builder icon indicating copy to clipboard operation
xml_builder copied to clipboard

Generating XML values for ASCII characters like newline or cr

Open cmarkle opened this issue 2 years ago • 5 comments

Amazon s3 key names with ASCII chars like \n or \r are expected to be mapped in XML data to like "& # 13; " or "& # 10 ;" (added spaces to ensure all would be seen here) or the like. Can I motivate XmlBuilder to generate that? When I use XmlBuilder.generate with character data with those special characters in it, it is not happening like this:

iex(17)> {:person, %{id: 12345}, "Josh\n"} |> XmlBuilder.generate
"<person id=\"12345\">Josh\n</person>"

...where I would like it to generate:

"<person id=\"12345\">Josh&#10;</person>"

I looked in the tests to see if there are any examples of this and I did not see any...

Thanks in advance for any help on this.

cmarkle avatar May 04 '23 18:05 cmarkle

Hi @cmarkle,

According to the XML spec:

Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646.

Can you explain the use case, what you are trying to accomplish?

joshnuss avatar May 06 '23 09:05 joshnuss

@joshnuss Sorry should have been more specific about why I want to do this. We are pulling Amazon S3 bucket data, specifically a list of objects in the bucket, which is returned as XML. If our customer managed to make an object in with a '\r' (or other funky ASCII character) in the object key name (which is do-able, although not recommended, in S3), then we'd like the accurate name (by "accurate" I mean '\r' not altered/mapped to '\n') so that we can turn around and use that object name is something like a delete request. As it is right now we can't delete these funkily-named objects.

Here's a specific example of what I am talking about:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Name>bucket-name-obfuscated</Name>
  <Prefix></Prefix>
  <NextContinuationToken>1ILUW_obfuscated_YlD9k0Yuf7RxD4ArX1yMUM</NextContinuationToken>
  <KeyCount>1000</KeyCount>
  <MaxKeys>1000</MaxKeys>
  <Delimiter></Delimiter>
  <IsTruncated>true</IsTruncated>
  <Contents>
    <Key>workspaces/51109/packages/ND-7H46rSQ.asp-package/contents/WHALE -&gt; MH/Whale_TTB_DOCUMENTS/Aspera Inboxes/Icon&#13;</Key>
    <LastModified>2022-03-25T23:52:22.122Z</LastModified>
    <ETag>&quot;d41d8cd98f00b204e9800998ecf8427e&quot;</ETag>
    <Size>0</Size>
    <StorageClass>STANDARD</StorageClass>
  </Contents>
</ListBucketResult>

Note the " " ('\r') at the end of the Key.

cmarkle avatar May 06 '23 16:05 cmarkle

Thanks for the clarification,

When your parser parses &#13;, does it convert it to \r?

Because running \r through this library should be preserved (I believe)

joshnuss avatar May 06 '23 22:05 joshnuss

@joshnuss Our parser (which right now is SweetXml) does convert &#13; to \r, but then maps the single \r' to newline \n', which is its own problem in my case.

Running \r through XmlBuilder does preserve the \r:

iex(13)> {:person, %{id: 12345}, "Josh\n"} |> XmlBuilder.generate
"<person id=\"12345\">Josh\n</person>"
iex(14)> {:person, %{id: 12345}, "Josh\r"} |> XmlBuilder.generate 
"<person id=\"12345\">Josh\r</person>"

AWS guidance on this is documented in "Creating object key names (https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html) - see in particular the section "XML related object key constraints". Duplicating that here:

XML related object key constraints
As specified by the [XML standard on end-of-line handling](https://www.w3.org/TR/REC-xml/#sec-line-ends), 
all XML text is normalized such that single carriage returns (ASCII code 13) and carriage returns 
immediately followed by a line feed (ASCII code 10) are replaced by a single line feed character. To 
ensure the correct parsing of object keys in XML requests, carriage returns and [other special characters 
must be replaced with their equivalent XML entity code](https://www.w3.org/TR/xml/#syntax) when they 
are inserted within XML tags. The following is a list of such special characters and their equivalent entity 
codes:

' as &apos;
” as &quot;
& as &amp;
< as &lt;
> as &gt;
\r as &#13; or &#x0D;
\n as &#10; or &#x0A;

The following example illustrates the use of an XML entity code as a substitution for a carriage return. 
This DeleteObjects request deletes an object with the key parameter: 
/some/prefix/objectwith\rcarriagereturn (where the \r is the carriage return).

<Delete xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Object>
    <Key>/some/prefix/objectwith&#13;carriagereturn</Key>
  </Object>
</Delete>

So net-net, if there was a way that XmlBuilder could be motivated to do this for \r and \n like it does for other special characters, that would be great:

iex(15)> {:person, %{id: 12345}, "Josh<>\r"} |> XmlBuilder.generate
"<person id=\"12345\">Josh&lt;&gt;\r</person>"

cmarkle avatar May 07 '23 16:05 cmarkle

We could add it to the list of escaped string patterns, see escape_string/1: https://github.com/joshnuss/xml_builder/blob/5b5ae47116259b426c7602692e75b320ba579486/lib/xml_builder.ex#L403-L409

Do you wanna open a PR? Also, I'm thinking to wait on merging it until there are more requests about this fix.

Keep in mind, this is a package that many apps are dependent on, so I hesitate to change the output and cause extra working for anyone.

joshnuss avatar May 07 '23 20:05 joshnuss