Implement `Saxy.stream_state/5`
Add a new function Saxy.stream_state/5 to have more flexibility over parsing.
The existing function Saxy.stream_events/2 does not discard events and does not empty the state while parsing. This led to memory and performance issues when we handled files of more than 50MB. Our current solution uses Saxy.Transform and Saxy.Partial while controlling the emit/cleanup manually along the stream.
We propose Saxy.stream_state/5 as a generic solution. As with Saxy.parse_stream/4, it accepts a Saxy.Handler and an initial state. It also requires an emit function that controls state emission and cleanup while streaming.
Example of our current manual solution for comparison
defmodule CustomXMLParser do
def parse_to_stream(xml_stream) do
Stream.transform(xml_stream, &new_partial/0, &emit_elements/2, &close_stream/1)
end
defp new_partial() do
new_state = %{parsed: [], current: nil}
{:ok, partial} = Saxy.Partial.new(CustomXMLParser.Handler, new_state)
partial
end
defp emit_elements(_, {:stop, partial}), do: {:halt, partial}
defp emit_elements(xml, partial) do
with state <- cleanup_previously_emitted(partial),
{:cont, partial} <- Saxy.Partial.parse(partial, xml, state),
emitted <- get_parsed(partial) do
{emitted, partial}
else
{:error, exception} ->
emitted = [
{:error, {:parse_error, Saxy.ParseError.message(exception)}}
]
{emitted, {:stop, partial}}
end
end
defp close_stream({:stop, partial}), do: Saxy.Partial.terminate(partial)
defp close_stream(partial), do: Saxy.Partial.terminate(partial)
defp cleanup_previously_emitted(partial), do: %{Saxy.Partial.get_state(partial) | parsed: []}
defp get_parsed(partial), do: Saxy.Partial.get_state(partial)[:parsed]
end
Example of proposed solution
defmodule CustomXMLParser do
def parse_to_stream(xml_stream) do
Saxy.stream_state(
xml_stream,
CustomXMLParser.Handler,
%{parsed: [], current: nil},
fn %{parsed: parsed} = state -> {parsed, Map.put(state, :parsed, [])} end
)
end
end
@tanguykurylo thanks for the PR 🙏 . I love the idea, will take a deeper look into the PR later this week.