otp (feat) Support CDATA in xml export

trafficstars

#xmlText is spec'd with type = text :: text | cdata yet the type is ignored when the XML doc is exported

Where #xmlText.type = cdata the value for the element will be wrapped as ["<![CDATA[", TextValue, "]]>"]

This allows generating XML which does not convert characters <, > and & to <, > and &

Jul 07 '23 17:07 leonardb

CT Test Results

      2 files     24 suites 8m 33s :stopwatch: 2 162 tests 2 154 :heavy_check_mark:   8 :zzz: 0 :x: 3 980 runs 3 946 :heavy_check_mark: 34 :zzz: 0 :x:

Results for commit a14fd7d2.

:recycle: This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

Jul 07 '23 17:07 github-actions[bot]

Hmm, would you suggest just using lists:flatten/1 and 'walking' the resulting list rather than binary conversion and replace using Unicode functions? It should be safer.

On Wed, Jul 19, 2023, 07:53 Maria Scott @.***> wrote:

@.**** commented on this pull request.

In lib/xmerl/src/xmerl_lib.erl https://github.com/erlang/otp/pull/7484#discussion_r1267964242:

@@ -45,6 +45,9 @@ -include("xmerl_xsd.hrl").

+export_cdata(T) ->

[""].

Hm... I'm pretty sure this doesn't do the right thing.

OTOH, if the argument list T contains unicode data, eg "ħ" (same as [295]), iolist_to_binary will fail.

OTOH, if T contains something like "Ä§" (same as [196, 167]), iolist_to_binary will accept and convert it to <<196, 167>>, but unicode:characters_to_list will turn that into "ħ" ([295]).

— Reply to this email directly, view it on GitHub https://github.com/erlang/otp/pull/7484#discussion_r1267964242, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFKWMEBAY55XN6V54SEF73XQ7DDTANCNFSM6AAAAAA2CDB4ZQ . You are receiving this because you authored the thread.Message ID: @.***>

Jul 19 '23 13:07 leonardb

Hmm, would you suggest just using lists:flatten/1 and 'walking' the resulting list rather than binary conversion and replace using Unicode functions? It should be safer.

Hm, probably, I'm not sure about performance, though :sweat_smile: But it would be a little complicated if what you could get as input was indeed an iolist... even after flattening, you could have a mixture of bytes and binaries, and the ]]> could be split up over them, like [$a, $b, $c, $], <<$]>>, $>, $d, $e, $f]...

Jul 19 '23 16:07 Maria-12648430

What do you think of this?

-define(cdata_end_delim_escape, $], $], $], $], $>, $<, $!, $[, $C, $D, $A, $T, $A, $[, $>).

export_cdata(T) ->
    ["<![CDATA[", escape_cdata(T), "]]>"].

escape_cdata(T) when is_binary(T) ->
	{Escaped, Rest} = escape_cdata_binary(T),
	<< Escaped/binary, Rest/binary >>;
escape_cdata(T) ->
    escape_cdata_iolist(lists:flatten(T)).

escape_cdata_iolist([]) ->
    [];
escape_cdata_iolist([<<>> | More]) ->
    escape_cdata_iolist(More);
escape_cdata_iolist([A, <<>> | More]) ->
    escape_cdata_iolist([A | More]);
escape_cdata_iolist([A, B, <<>> | More]) ->
    escape_cdata_iolist([A, B | More]);
escape_cdata_iolist([$], $], $> | More]) ->
    [?cdata_end_delim_escape | escape_cdata_iolist(More)];
escape_cdata_iolist([$], $], << $>, MoreB/binary >> | More]) ->
    [?cdata_end_delim_escape | escape_cdata_iolist([MoreB | More])];
escape_cdata_iolist([$], << $] >>, $> | More]) ->
    [?cdata_end_delim_escape | escape_cdata_iolist(More)];
escape_cdata_iolist([$], << $] >>, << $>, MoreB/binary >> | More]) ->
    [?cdata_end_delim_escape | escape_cdata_iolist([MoreB | More])];
escape_cdata_iolist([$], << $], $>, MoreB/binary >> | More]) ->
    [?cdata_end_delim_escape | escape_cdata_iolist([MoreB | More])];
escape_cdata_iolist([<< $] >>, $], $> | More]) ->
    [?cdata_end_delim_escape | escape_cdata_iolist(More)];
escape_cdata_iolist([<< $] >>, $], << $>, MoreB/binary >> | More]) ->
    [?cdata_end_delim_escape | escape_cdata_iolist([MoreB | More])];
escape_cdata_iolist([<< $] >>, << $] >>, $> | More]) ->
    [?cdata_end_delim_escape | escape_cdata_iolist(More)];
escape_cdata_iolist([<< $], $] >>, $> | More]) ->
    [?cdata_end_delim_escape | escape_cdata_iolist(More)];
escape_cdata_iolist([<< $] >>, << $], $>, MoreB/binary>> | More]) ->
    [?cdata_end_delim_escape | escape_cdata_iolist([MoreB | More])];
escape_cdata_iolist([<< $] >>, << $] >>, << $>, MoreB/binary>> | More]) ->
    [?cdata_end_delim_escape | escape_cdata_iolist([MoreB | More])];
escape_cdata_iolist([<< $], $] >>, << $>, MoreB/binary>> | More]) ->
    [?cdata_end_delim_escape | escape_cdata_iolist([MoreB | More])];
escape_cdata_iolist([<< $], $], $>, MoreB/binary>> | More]) ->
    [?cdata_end_delim_escape | escape_cdata_iolist([MoreB | More])];
escape_cdata_iolist([<< $] >> = Bin | More]) ->
    [Bin | escape_cdata_iolist(More)];
escape_cdata_iolist([<< $], $] >> = Bin | More]) ->
    [Bin | escape_cdata_iolist(More)];
escape_cdata_iolist([Bin0 | More]) when is_binary(Bin0) ->
    {Bin1, Rest} = escape_cdata_binary(Bin0),
    [Bin1 | escape_cdata_iolist([Rest | More])];
escape_cdata_iolist([E | More]) ->
    [E | escape_cdata_iolist(More)].

escape_cdata_binary(Bin) ->
    escape_cdata_binary(Bin, <<>>).

escape_cdata_binary(<<>> = Rest, Acc) ->
    {Acc, Rest};
escape_cdata_binary(<< $] >> = Rest, Acc) ->
    {Acc, Rest};
escape_cdata_binary(<< $], $] >> = Rest, Acc) ->
    {Acc, Rest};
escape_cdata_binary(<< $], $], $>, More/binary >>, Acc) ->
    escape_cdata_binary(More, << Acc/binary, ?cdata_end_delim_escape >>);
escape_cdata_binary(<< C, More/binary >>, Acc) ->
    escape_cdata_binary(More, << Acc/binary, C >>).

It should catch all occurrences of ]]> in the given iolist, no matter if they are in binary elements, list bytes, or distributed over a mix like [ $], << $] >>, $> ]or [ << $x, $] >>, $], << $>, $y >> ], and replace it with ]]]]><![CDATA[>.

Jul 20 '23 07:07 Maria-12648430

~Ah, wait, it has a bug XD~ ~Fixed.~ ~Fixed again.~ Fixed yet again.

Jul 20 '23 08:07 Maria-12648430

~Ah, wait, it has a bug XD~ ~Fixed.~ Fixed again.

It's not unicode safe though.

Compared output for your code export_cadata/1 vs the code in the PR export_cdata2/1

Input:

S = ["some ", [<<"Ä§"/utf8>>, " ", "Ä§", " ħaa", <<" ❤️ "/utf8>>], " ]", [ "]", ["> "]], " other "].

Your code:

io:format("~ts~n",[cdata:export_cdata(S)]).
<![CDATA[some §ÂÃ Ä§ ħaa ¸ï¤â  ]]]]><![CDATA[>  other ]]>

rp(cdata:export_cdata(S)).
["<![CDATA[",
 [115,111,109,101,32,
  <<167,194,132,195>>,
  32,196,167,32,295,97,97,
  <<32,143,184,239,164,157,226,32>>,
  32,93,93,93,93,62,60,33,91,67,68,65,84,65,91,62,32,32,111,
  116,104,101,114,32],
 "]]>"]

PR code:

io:format("~ts~n",[cdata:export_cdata2(S)]).
<![CDATA[some Ä§ Ä§ ħaa ❤️  ]]]]><![CDATA[>  other ]]>

rp(cdata:export_cdata2(S)).
["<![CDATA[",
 [115,111,109,101,32,196,167,32,196,167,32,295,97,97,32,
  10084,65039,32,32,93,93,93,93,62,60,33,91,67,68,65,84,65,91,
  62,32,32,111,116,104,101,114,32],
 "]]>"]

I think we're entering bikeshedding territory.

The existing export_text is not unicode safe
I don't think it's adding sufficient overhead to even worry about it when using string:graphemes to get a unicode safe flattened list of codepoints to parse

Jul 20 '23 12:07 leonardb

It's not unicode safe though.

It would be if there hadn't been yet another bug in my code in escape_cdata_binary :roll_eyes: Should be fixed now, updated the comment yet again, try again if you like.

I think we're entering bikeshedding territory.

What does that mean? (The phrase is unfamiliar to me :sweat_smile:)

Jul 20 '23 13:07 Maria-12648430

Bike-shedding origins: https://en.wikipedia.org/wiki/Law_of_triviality

I'll also add that it seems that most of xmerl is not unicode safe. We cannot even round-trip the string we're testing with :worried:

xmerl_scan:string("<test><![CDATA[some Ä§ Ä§ ħaa ❤️   other ]]></test>").                
=ERROR REPORT==== 20-Jul-2023::09:51:29.621037 ===
2750- fatal: {unexpected_char,{error,{bad_character,295}}}

** exception exit: {fatal,{{unexpected_char,{error,{bad_character,295}}},
                           {file,file_name_unknown},
                           {line,1},
                           {col,26}}}
     in function  xmerl_scan:fatal/2 (xmerl_scan.erl, line 4127)
     in call from xmerl_scan:scan_content/11 (xmerl_scan.erl, line 2608)
     in call from xmerl_scan:scan_element/12 (xmerl_scan.erl, line 2136)
     in call from xmerl_scan:scan_document/2 (xmerl_scan.erl, line 578)
     in call from xmerl_scan:string/2 (xmerl_scan.erl, line 294)

Jul 20 '23 13:07 leonardb

Bike-shedding origins: https://en.wikipedia.org/wiki/Law_of_triviality

Ah, so there is a name for that :smile: But let's just say that I love to exercise my brain :wink:

I'll also add that it seems that most of xmerl is not unicode safe. We cannot even round-trip the string we're testing with worried

I'm not debating that, xmerl is apparently in a state of neglect and disrepair :cry:

Jul 20 '23 15:07 Maria-12648430

Alas, this is one of the not so prioritized applications that would benefit from some more love. We will at least try to give this PR some attention after the vacation period.

Aug 01 '23 08:08 IngelaAndin

Thanks Ingela,

Was debating whether to just close it or working on adding Unicode support to the whole xmerl app.

If the latter I'd probably lean heavily on the string and Unicode apps. Would that be preferable?

On Tue, Aug 1, 2023, 04:34 Ingela Andin @.***> wrote:

Alas, this is one of the not so prioritized applications that would benefit from some more love. We will at least try to give this PR some attention after the vacation period.

— Reply to this email directly, view it on GitHub https://github.com/erlang/otp/pull/7484#issuecomment-1659833518, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFKWMAKSS6J2JQBLIV432LXTC5RVANCNFSM6AAAAAA2CDB4ZQ . You are receiving this because you authored the thread.Message ID: @.***>

Aug 01 '23 10:08 leonardb

adding Unicode support to the whole xmerl app. > If the latter I'd probably lean heavily on the string and Unicode apps.

IMO, if you are going to tackle unicode for all of xmerl, then I would say yes, by all means, use them 😄 That's what they are there for, and well tested, no need to reinvent the wheel (with bugs 😅).

I was against using them just like that in the context of this PR, ie without the surrounding app being unicode-safe, not because of performance reasons but because it would rely on assumptions about the data flying around that may hold for all actual use cases, but are not guaranteed to.

With xmerl as a whole unicode-safe, I would say unicode and string galore 😉

Aug 02 '23 15:08 Maria-12648430

otp otp copied to clipboard

(feat) Support CDATA in xml export

CT Test Results

Artifacts

@.**** commented on this pull request.

otp
otp copied to clipboard