saxy icon indicating copy to clipboard operation
saxy copied to clipboard

Implement `Saxy.stream_state/5`

Open tanguykurylo opened this issue 1 year ago • 1 comments

Add a new function Saxy.stream_state/5 to have more flexibility over parsing.

The existing function Saxy.stream_events/2 does not discard events and does not empty the state while parsing. This led to memory and performance issues when we handled files of more than 50MB. Our current solution uses Saxy.Transform and Saxy.Partial while controlling the emit/cleanup manually along the stream.

We propose Saxy.stream_state/5 as a generic solution. As with Saxy.parse_stream/4, it accepts a Saxy.Handler and an initial state. It also requires an emit function that controls state emission and cleanup while streaming.

Example of our current manual solution for comparison

defmodule CustomXMLParser do
  def parse_to_stream(xml_stream) do
    Stream.transform(xml_stream, &new_partial/0, &emit_elements/2, &close_stream/1)
  end

  defp new_partial() do
    new_state =  %{parsed: [], current: nil}
    {:ok, partial} = Saxy.Partial.new(CustomXMLParser.Handler, new_state)
    partial
  end

  defp emit_elements(_, {:stop, partial}), do: {:halt, partial}

  defp emit_elements(xml, partial) do
    with state <- cleanup_previously_emitted(partial),
         {:cont, partial} <- Saxy.Partial.parse(partial, xml, state),
         emitted <- get_parsed(partial) do
      {emitted, partial}
    else
      {:error, exception} ->
        emitted = [
          {:error, {:parse_error, Saxy.ParseError.message(exception)}}
        ]

        {emitted, {:stop, partial}}
    end
  end

  defp close_stream({:stop, partial}), do: Saxy.Partial.terminate(partial)
  defp close_stream(partial), do: Saxy.Partial.terminate(partial)

  defp cleanup_previously_emitted(partial), do: %{Saxy.Partial.get_state(partial) | parsed: []}
  defp get_parsed(partial), do: Saxy.Partial.get_state(partial)[:parsed]
end

Example of proposed solution

defmodule CustomXMLParser do
  def parse_to_stream(xml_stream) do
    Saxy.stream_state(
      xml_stream,
      CustomXMLParser.Handler,
      %{parsed: [], current: nil},
      fn %{parsed: parsed} = state -> {parsed, Map.put(state, :parsed, [])} end
    )
  end
end

tanguykurylo avatar Jul 26 '24 13:07 tanguykurylo

@tanguykurylo thanks for the PR 🙏 . I love the idea, will take a deeper look into the PR later this week.

qcam avatar Oct 22 '24 13:10 qcam