syrupy Excessive reparsing of `ambr` files in tests with large snapshots

trafficstars

Describe the bug

We are encountering a problem in our testing process where we handle very large DOCX/XML files. Our test setup involves using the syrupy library to create snapshots. Here's a snippet of our test code:

import lxml.etree
import pytest
from syrupy import snapshot

from nldoc_p4_docx.docx_processing.docx.document import Document
from tests.conftest import all_fixture_paths, matcher


@pytest.mark.parametrize('xml', all_fixture_paths('docx/documents'), indirect=True)
def test_loading_documents(snapshot, xml: lxml.etree.Element):
    assert Document(xml) == snapshot(matcher=matcher)

While generating large snapshot files is not a problem in itself, we are facing an issue where the ambr file is completely reparsed for each test run, even though we only need a snapshot for the specific fixture being tested.

To reproduce

Run a test that uses large fixtures, which results in the creation of large snapshot files.

Expected behavior

We expect that testing a large fixture would slow down only that specific test, rather than affecting the performance of all tests that use the same snapshot file. A potential solution might be to have a setting that allows splitting snapshot files with different names into separate files.

Environment (please complete the following information):

OS: macOS, Darwin 23.1.0
Syrupy Version: 4.6.0
Python Version: 3.11.6

Additional context

Our repository is available at: https://gitlab.com/toegang-voor-iedereen/conversion/preprocessing/docx-preprocessor

Nov 22 '23 16:11 StephanMeijer

I saw a huge increase in performance after splitting my tests in separate files,but it still seems to take a long time for syrupy to parse.

Nov 22 '23 16:11 StephanMeijer

As also mentioned in #119, with this extension I reduced the time my tests took from around 200 seconds to around 5.

class SingleFileAmberSnapshotExtension(SingleFileSnapshotExtension):
    def serialize(self, data, **kwargs):
        return AmberDataSerializer.serialize(data, **kwargs).encode()

Nov 22 '23 16:11 StephanMeijer

Got things working perfectly with the following snipped:

class SingleFileAmberSnapshotExtension(SingleFileSnapshotExtension):
    _file_extension = 'ambr'
    _write_mode = WriteMode.TEXT

    def serialize(self, data, **kwargs):
        return AmberDataSerializer.serialize(data, **kwargs)

This means I am not going to invest any more time in this ticket. If no objections, this ticket can be closed.

Nov 22 '23 16:11 StephanMeijer

Let's keep the issue open. We shouldn't be parsing the snapshot file more than once. There's a cache. It must be getting invalidated. Not sure when I'll have time to dig into this but will loop back to this when I have a chance.

Nov 25 '23 03:11 noahnu

Let's keep the issue open. We shouldn't be parsing the snapshot file more than once. There's a cache. It must be getting invalidated. Not sure when I'll have time to dig into this but will loop back to this when I have a chance.

That is interesting.. As the performance difference is about a factor of a hundred.

Nov 27 '23 08:11 StephanMeijer

syrupy syrupy copied to clipboard

Excessive reparsing of `ambr` files in tests with large snapshots

syrupy
syrupy copied to clipboard