yaccety_sax icon indicating copy to clipboard operation
yaccety_sax copied to clipboard

How to replace one value with another?

Open tomekowal opened this issue 5 years ago • 7 comments

Hey! There is no documentation and we would like to try it. Our use case is that we want to modify elements based on their contents. In example reverse contents of tag/subtag

<tag>
  <subtag>asdf</subtag>
  <subtag>qwer</subtag>
  <subtag>asdf</subtag>
<tag>
<tag>
  <subtag>fdsa</subtag>
  <subtag>rewq</subtag>
  <subtag>fdsa</subtag>
<tag>

tomekowal avatar Oct 22 '19 14:10 tomekowal

@tomekowal, Thanks for the issue and interest. I've been a bit busy, but I hope to get to this next week.

I'll make the project "rebar-able" and add some documentation/examples for handling individual nodes and transforming document streams.

zadean avatar Oct 27 '19 12:10 zadean

@tomekowal , So, before making a bunch of docs that will need to changed. I'd like to see if this makes sense to you:

It's a more "procedural" example for ease of following the flow. Mind you, the API isn't near final, but the parser will always be similar to an iterator, and there will be a writer as well as a reader. So that won't change. Maybe just the names. :-)

run() ->
    Input = <<"<tag>\n  <subtag>asdf</subtag>\n  <subtag>qwer</subtag>\n  "
              "<subtag>asdf</subtag>\n</tag>">>,
    State = stax:stream(Input, [{whitespace, false}]),
    % fake it for now until there is a serialization API
    OutState = {<<>>, #{}},

    % read and assert the startDocument event, write it out
    {#{type := startDocument} = E1, State1} = stax:next_event(State),
    OutState1 = stax:write_event(E1, OutState),

    % read and assert the startElement event for the "tag" tag, write it out
    {#{type  := startElement,
       qname := {<<>>, <<>>, <<"tag">>}} = E2, State2} = stax:next_event(State1),
    OutState2 = stax:write_event(E2, OutState1),

    {State3, OutState3} = reverse_subtag(State2, OutState2),
    {State4, OutState4} = reverse_subtag(State3, OutState3),
    {State5, OutState5} = reverse_subtag(State4, OutState4),

    % read and assert the endElement event for the "tag" tag, write it out
    {#{type  := endElement,
       qname := {<<>>, <<>>, <<"tag">>}} = E3, State6} = stax:next_event(State5),
    OutState6 = stax:write_event(E3, OutState5),

    % read and assert the endDocument event, write it out
    {#{type := endDocument} = E4, _State7} = stax:next_event(State6),
    {Output, _} = stax:write_event(E4, OutState6),

    Output.

reverse_subtag(State, OutState) ->
    case stax:next_event(State) of
        % the 'subtag' opening tag
        {#{type := startElement} = E1, State1} ->
            OutState1 = stax:write_event(E1, OutState),
            reverse_subtag(State1, OutState1);
        % the text to change
        {#{type := characters,
           data := Sub} = E1, State1} ->
            OutState1 = stax:write_event(E1#{data := do_flip(Sub)}, OutState),
            reverse_subtag(State1, OutState1);
        % the 'subtag' closing tag, so return
        {#{type := endElement} = E1, State1} ->
            OutState1 = stax:write_event(E1, OutState),
            {State1, OutState1}
    end.

do_flip(Text) ->
    Chs = [T || <<T/utf8>> <= Text],
    Rev = lists:reverse(Chs),
    << <<C/utf8>> || C <- Rev >>.

zadean avatar Nov 03 '19 14:11 zadean

Seems clear. I just realised that there is no Enum.reduce in Erlang, only foldl and foldr on lists, so the recursive bits need to be written by hand. Also, I think you can use string:reverse because it correctly groups things into grapheme clusters, but still retunrs io data (but that is outside of the discussion :))

tomekowal avatar Nov 04 '19 08:11 tomekowal

Yeah... string:reverse doh! :-) Since I have no experience with Elixir, it would be interesting to see what the same example would look like with it. Also is the return type from the stax:next_event call, with {Event, State} easy enough, or should that be changed to something else?

zadean avatar Nov 04 '19 12:11 zadean

Hey, I made an example elixir application that uses yaccety_sax https://github.com/tomekowal/yaccety_sax_test/blob/master/test/yaccety_sax_test_test.exs All the exciting stuff is in the test file. The first test is what you pasted above rewritten in Elixir. The second one is an example of using Elixir streams and Enum.reduce to work with it. The third one is again reversing example but using streams.

As you can see, the {Event, State} is perfect because stream generators expect exactly that format. {CurrentElement, StateToBuildNextElement}.

tomekowal avatar Nov 10 '19 21:11 tomekowal

Cool! And great that the output format fits so well!

Time permitting, I'll try to finish the rest of the implementation (DTD, default attributes stuff, external references and entities, etc.).

Also documentation. :-)

How was the performance?? low memory footprint? fast enough?

zadean avatar Nov 15 '19 17:11 zadean

Unfortunately, I didn't test it on anything more significant than that toy example. We don't have that many big XML files, anyway. For now, we settled on using :xmerl in our project. We will watch closely how this repo evolves :)

tomekowal avatar Nov 15 '19 21:11 tomekowal