referencing icon indicating copy to clipboard operation
referencing copied to clipboard

Performance of referencing library vs deprecated jsonschema.RefResolver is very bad when there are a lot of references in schema

Open nathan-stender opened this issue 1 year ago • 4 comments

Hello!

I have a library that formats scientific data into a JSON schema called the Allotrope Standard Model (ASM)

The validation schemas are fairly large and complicated compared to other schemas I've seen in discussion boards, and are very modular, meaning there are a lot of references. In allotropy we store the ASM schemas directly, and remove all remote references, replacing them with local references under $defs.

We are finding that validating against the schemas using jsonschema version 4.18.0 takes ~20x longer than 4.17.0.

As a concrete example:

Validating this data: https://raw.githubusercontent.com/Benchling-Open-Source/allotropy/refs/heads/main/tests/parsers/moldev_softmax_pro/testdata/MD_SMP_luminescence_endpoint_example08.json

Against this schema: https://github.com/Benchling-Open-Source/allotropy/blob/main/src/allotropy/allotrope/schemas/adm/plate-reader/REC/2024/06/plate-reader.schema.json

takes ~3.5s on 4.17.0 and ~55s on 4.18.0

This translates to a runtime for all 26 tests in tests/parsers/moldev_softmax_pro of ~30s in 4.17.0 to ~6m in 4.18.0

nathan-stender avatar Sep 25 '24 14:09 nathan-stender

Hey there, I'm happy to have a look at this at some point, but is there a reason you're benchmarking against such an old version? Lots has changed since 4.18, so it'd be good if you shared numbers which were on 4.23.

Julian avatar Sep 25 '24 14:09 Julian

Sorry, I didn't mention that I tested on every version between 4.18 and 4.23 to see if any had better performance. None of the versions past 4.18 improve the performance noticeably.

On 4.23, the results are actually a bit worse:

For the single test: 55s For the 26 tests: 6m33s

nathan-stender avatar Sep 25 '24 16:09 nathan-stender

We have also experienced similar performance issue in one of our tool after switching from RefResolver to this library. This is the commit in our library: https://github.com/PolusAI/workflow-inference-compiler/pull/287

sameeul avatar Nov 14 '24 16:11 sameeul

We may be encountering the same issue, according to the profiler. We are still investigating Version: referencing==0.35.1 From the profiler:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     8257    0.027    0.000    6.939    0.001 <path>/.venv/lib/python3.12/site-packages/referencing/_core.py:643(lookup)
     8257    0.027    0.000    6.495    0.001 <path>/.venv/lib/python3.12/site-packages/referencing/_core.py:405(get_or_retrieve)
     5187    0.923    0.000    6.087    0.001 <path>/.venv/lib/python3.12/site-packages/referencing/_core.py:485(crawl)
   907475    0.340    0.000    3.505    0.000 <path>/.venv/lib/python3.12/site-packages/referencing/_core.py:501(<genexpr>)
   907475    0.486    0.000    3.165    0.000 <path>/.venv/lib/python3.12/site-packages/referencing/_core.py:235(<genexpr>)

ilovelinux avatar Mar 05 '25 11:03 ilovelinux