otp
otp copied to clipboard
(feat) Support CDATA in xml export
#xmlText is spec'd with type = text :: text | cdata yet the type is ignored when the XML doc is exported
Where #xmlText.type = cdata the value for the element will be wrapped as ["<![CDATA[", TextValue, "]]>"]
This allows generating XML which does not convert characters <, > and & to <, > and &
CT Test Results
2 files 24 suites 8m 33s :stopwatch: 2 162 tests 2 154 :heavy_check_mark: 8 :zzz: 0 :x: 3 980 runs 3 946 :heavy_check_mark: 34 :zzz: 0 :x:
Results for commit a14fd7d2.
:recycle: This comment has been updated with latest results.
To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.
See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.
Artifacts
- Complete CT logs (Download Logs)
- HTML Documentation (Download HTML Docs)
- No Windows Installer found
// Erlang/OTP Github Action Bot
Hmm, would you suggest just using lists:flatten/1 and 'walking' the resulting list rather than binary conversion and replace using Unicode functions? It should be safer.
On Wed, Jul 19, 2023, 07:53 Maria Scott @.***> wrote:
@.**** commented on this pull request.
In lib/xmerl/src/xmerl_lib.erl https://github.com/erlang/otp/pull/7484#discussion_r1267964242:
@@ -45,6 +45,9 @@ -include("xmerl_xsd.hrl").
+export_cdata(T) ->
- [""].
Hm... I'm pretty sure this doesn't do the right thing.
OTOH, if the argument list T contains unicode data, eg "ħ" (same as [295]), iolist_to_binary will fail.
OTOH, if T contains something like "ħ" (same as [196, 167]), iolist_to_binary will accept and convert it to <<196, 167>>, but unicode:characters_to_list will turn that into "ħ" ([295]).
— Reply to this email directly, view it on GitHub https://github.com/erlang/otp/pull/7484#discussion_r1267964242, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFKWMEBAY55XN6V54SEF73XQ7DDTANCNFSM6AAAAAA2CDB4ZQ . You are receiving this because you authored the thread.Message ID: @.***>
Hmm, would you suggest just using lists:flatten/1 and 'walking' the resulting list rather than binary conversion and replace using Unicode functions? It should be safer.
Hm, probably, I'm not sure about performance, though :sweat_smile: But it would be a little complicated if what you could get as input was indeed an iolist... even after flattening, you could have a mixture of bytes and binaries, and the ]]> could be split up over them, like [$a, $b, $c, $], <<$]>>, $>, $d, $e, $f]...
What do you think of this?
-define(cdata_end_delim_escape, $], $], $], $], $>, $<, $!, $[, $C, $D, $A, $T, $A, $[, $>).
export_cdata(T) ->
["<![CDATA[", escape_cdata(T), "]]>"].
escape_cdata(T) when is_binary(T) ->
{Escaped, Rest} = escape_cdata_binary(T),
<< Escaped/binary, Rest/binary >>;
escape_cdata(T) ->
escape_cdata_iolist(lists:flatten(T)).
escape_cdata_iolist([]) ->
[];
escape_cdata_iolist([<<>> | More]) ->
escape_cdata_iolist(More);
escape_cdata_iolist([A, <<>> | More]) ->
escape_cdata_iolist([A | More]);
escape_cdata_iolist([A, B, <<>> | More]) ->
escape_cdata_iolist([A, B | More]);
escape_cdata_iolist([$], $], $> | More]) ->
[?cdata_end_delim_escape | escape_cdata_iolist(More)];
escape_cdata_iolist([$], $], << $>, MoreB/binary >> | More]) ->
[?cdata_end_delim_escape | escape_cdata_iolist([MoreB | More])];
escape_cdata_iolist([$], << $] >>, $> | More]) ->
[?cdata_end_delim_escape | escape_cdata_iolist(More)];
escape_cdata_iolist([$], << $] >>, << $>, MoreB/binary >> | More]) ->
[?cdata_end_delim_escape | escape_cdata_iolist([MoreB | More])];
escape_cdata_iolist([$], << $], $>, MoreB/binary >> | More]) ->
[?cdata_end_delim_escape | escape_cdata_iolist([MoreB | More])];
escape_cdata_iolist([<< $] >>, $], $> | More]) ->
[?cdata_end_delim_escape | escape_cdata_iolist(More)];
escape_cdata_iolist([<< $] >>, $], << $>, MoreB/binary >> | More]) ->
[?cdata_end_delim_escape | escape_cdata_iolist([MoreB | More])];
escape_cdata_iolist([<< $] >>, << $] >>, $> | More]) ->
[?cdata_end_delim_escape | escape_cdata_iolist(More)];
escape_cdata_iolist([<< $], $] >>, $> | More]) ->
[?cdata_end_delim_escape | escape_cdata_iolist(More)];
escape_cdata_iolist([<< $] >>, << $], $>, MoreB/binary>> | More]) ->
[?cdata_end_delim_escape | escape_cdata_iolist([MoreB | More])];
escape_cdata_iolist([<< $] >>, << $] >>, << $>, MoreB/binary>> | More]) ->
[?cdata_end_delim_escape | escape_cdata_iolist([MoreB | More])];
escape_cdata_iolist([<< $], $] >>, << $>, MoreB/binary>> | More]) ->
[?cdata_end_delim_escape | escape_cdata_iolist([MoreB | More])];
escape_cdata_iolist([<< $], $], $>, MoreB/binary>> | More]) ->
[?cdata_end_delim_escape | escape_cdata_iolist([MoreB | More])];
escape_cdata_iolist([<< $] >> = Bin | More]) ->
[Bin | escape_cdata_iolist(More)];
escape_cdata_iolist([<< $], $] >> = Bin | More]) ->
[Bin | escape_cdata_iolist(More)];
escape_cdata_iolist([Bin0 | More]) when is_binary(Bin0) ->
{Bin1, Rest} = escape_cdata_binary(Bin0),
[Bin1 | escape_cdata_iolist([Rest | More])];
escape_cdata_iolist([E | More]) ->
[E | escape_cdata_iolist(More)].
escape_cdata_binary(Bin) ->
escape_cdata_binary(Bin, <<>>).
escape_cdata_binary(<<>> = Rest, Acc) ->
{Acc, Rest};
escape_cdata_binary(<< $] >> = Rest, Acc) ->
{Acc, Rest};
escape_cdata_binary(<< $], $] >> = Rest, Acc) ->
{Acc, Rest};
escape_cdata_binary(<< $], $], $>, More/binary >>, Acc) ->
escape_cdata_binary(More, << Acc/binary, ?cdata_end_delim_escape >>);
escape_cdata_binary(<< C, More/binary >>, Acc) ->
escape_cdata_binary(More, << Acc/binary, C >>).
It should catch all occurrences of ]]> in the given iolist, no matter if they are in binary elements, list bytes, or distributed over a mix like [ $], << $] >>, $> ]or [ << $x, $] >>, $], << $>, $y >> ], and replace it with ]]]]><![CDATA[>.
~Ah, wait, it has a bug XD~ ~Fixed.~ ~Fixed again.~ Fixed yet again.
~Ah, wait, it has a bug XD~ ~Fixed.~ Fixed again.
It's not unicode safe though.
Compared output for your code export_cadata/1 vs the code in the PR export_cdata2/1
Input:
S = ["some ", [<<"ħ"/utf8>>, " ", "ħ", " ħaa", <<" ❤️ "/utf8>>], " ]", [ "]", ["> "]], " other "].
Your code:
io:format("~ts~n",[cdata:export_cdata(S)]).
<![CDATA[some §Âà ħ ħaa ¸ï¤â ]]]]><![CDATA[> other ]]>
rp(cdata:export_cdata(S)).
["<![CDATA[",
[115,111,109,101,32,
<<167,194,132,195>>,
32,196,167,32,295,97,97,
<<32,143,184,239,164,157,226,32>>,
32,93,93,93,93,62,60,33,91,67,68,65,84,65,91,62,32,32,111,
116,104,101,114,32],
"]]>"]
PR code:
io:format("~ts~n",[cdata:export_cdata2(S)]).
<![CDATA[some ħ ħ ħaa ❤️ ]]]]><![CDATA[> other ]]>
rp(cdata:export_cdata2(S)).
["<![CDATA[",
[115,111,109,101,32,196,167,32,196,167,32,295,97,97,32,
10084,65039,32,32,93,93,93,93,62,60,33,91,67,68,65,84,65,91,
62,32,32,111,116,104,101,114,32],
"]]>"]
I think we're entering bikeshedding territory.
- The existing export_text is not unicode safe
- I don't think it's adding sufficient overhead to even worry about it when using
string:graphemesto get a unicode safe flattened list of codepoints to parse
It's not unicode safe though.
It would be if there hadn't been yet another bug in my code in escape_cdata_binary :roll_eyes: Should be fixed now, updated the comment yet again, try again if you like.
I think we're entering bikeshedding territory.
What does that mean? (The phrase is unfamiliar to me :sweat_smile:)
Bike-shedding origins: https://en.wikipedia.org/wiki/Law_of_triviality
I'll also add that it seems that most of xmerl is not unicode safe. We cannot even round-trip the string we're testing with :worried:
xmerl_scan:string("<test><![CDATA[some ħ ħ ħaa ❤️ other ]]></test>").
=ERROR REPORT==== 20-Jul-2023::09:51:29.621037 ===
2750- fatal: {unexpected_char,{error,{bad_character,295}}}
** exception exit: {fatal,{{unexpected_char,{error,{bad_character,295}}},
{file,file_name_unknown},
{line,1},
{col,26}}}
in function xmerl_scan:fatal/2 (xmerl_scan.erl, line 4127)
in call from xmerl_scan:scan_content/11 (xmerl_scan.erl, line 2608)
in call from xmerl_scan:scan_element/12 (xmerl_scan.erl, line 2136)
in call from xmerl_scan:scan_document/2 (xmerl_scan.erl, line 578)
in call from xmerl_scan:string/2 (xmerl_scan.erl, line 294)
Bike-shedding origins: https://en.wikipedia.org/wiki/Law_of_triviality
Ah, so there is a name for that :smile: But let's just say that I love to exercise my brain :wink:
I'll also add that it seems that most of xmerl is not unicode safe. We cannot even round-trip the string we're testing with worried
I'm not debating that, xmerl is apparently in a state of neglect and disrepair :cry:
Alas, this is one of the not so prioritized applications that would benefit from some more love. We will at least try to give this PR some attention after the vacation period.
Thanks Ingela,
Was debating whether to just close it or working on adding Unicode support to the whole xmerl app.
If the latter I'd probably lean heavily on the string and Unicode apps. Would that be preferable?
On Tue, Aug 1, 2023, 04:34 Ingela Andin @.***> wrote:
Alas, this is one of the not so prioritized applications that would benefit from some more love. We will at least try to give this PR some attention after the vacation period.
— Reply to this email directly, view it on GitHub https://github.com/erlang/otp/pull/7484#issuecomment-1659833518, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFKWMAKSS6J2JQBLIV432LXTC5RVANCNFSM6AAAAAA2CDB4ZQ . You are receiving this because you authored the thread.Message ID: @.***>
adding Unicode support to the whole xmerl app. > If the latter I'd probably lean heavily on the string and Unicode apps.
IMO, if you are going to tackle unicode for all of xmerl, then I would say yes, by all means, use them 😄 That's what they are there for, and well tested, no need to reinvent the wheel (with bugs 😅).
I was against using them just like that in the context of this PR, ie without the surrounding app being unicode-safe, not because of performance reasons but because it would rely on assumptions about the data flying around that may hold for all actual use cases, but are not guaranteed to.
With xmerl as a whole unicode-safe, I would say unicode and string galore 😉