core
core copied to clipboard
Cache functionality
Also, why have separate _fileGrp_cache and _file_cache - doesn't the latter also contain the fileGrp info in a fast-to-access way? What would be more beneficial would be caching the pageId of a file because that requires a bit of a convoluted XPath to look up on-the-fly.
Also, why have separate
_fileGrp_cacheand_file_cache- doesn't the latter also contain the fileGrp info in a fast-to-access way? What would be more beneficial would be caching thepageIdof a file because that requires a bit of a convoluted XPath to look up on-the-fly.
I think this should be explanatory enough:
# Cache for the fileGrps (mets:fileGrp) - a dictionary with Key and Value pair:
# Key: `'fileGrp.USE`
# Value: a 'fileGrp' object at some memory location
self._fileGrp_cache = {}
# Cache for the files (mets:file) - two nested dictionaries
# The outer dictionary's Key: 'fileGrp.USE'
# The outer dictionary's Value: Inner dictionary
# The inner dictionary's Key: 'file.ID'
# The inner dictionary's Value: a 'file' object at some memory location
self._file_cache = {}
The _fileGrp_cache holds a reference to the fileGrp element itself aka the memory location.
The _file_cache holds just a string of the fileGrp.
Having a second thought on it. Maybe we do not benefit that much from holding fileGrp elements inside a separate cache. We can just search the tree when we need a specific element.
The
_fileGrp_cacheholds a reference to the fileGrp element itself aka the memory location. The_file_cacheholds just a string of the fileGrp.
I get that, but the main/only use for querying about fileGrp is: Does this fileGrp exist (a) and what files are in it (b)? These can be answered with the _file_cache, by checking needle in _file_cache (a) and for el_file in self._file_cache[needle]: (b). Since the fileGrp element itself contains no other information in the METS that I'm aware of, it seems redundant to keep it around.

Building the METS file with 50 files inside got improved from 8,7s down to 508ms! Here we go, less electricity consumption for everyone :)
Related issue: https://github.com/OCR-D/core/issues/723
Related discussion: https://github.com/OCR-D/zenhub/issues/39
Still missing IIUC:
I would suggest diversifying test scenarios BTW: * thousands of pages * dozens of fileGrps * fileGrps with hundreds of files per page (think
ocrd-cis-ocropy-dewarpfor line level image files orocrd-cis-ocropy-deskew/ocrd-tesserocr-deskewfor region level files)
And a switch to enable/disable caching – ideally somewhere central, but we don't have a general-purpose configuration mechanism yet (except for logging), cf. options.
Related issue: #723
Related discussion: OCR-D/zenhub#39
Still missing IIUC:
I would suggest diversifying test scenarios BTW:
- thousands of pages
- dozens of fileGrps
- fileGrps with hundreds of files per page (think
ocrd-cis-ocropy-dewarpfor line level image files orocrd-cis-ocropy-deskew/ocrd-tesserocr-deskewfor region level files)And a switch to enable/disable caching – ideally somewhere central, but we don't have a general-purpose configuration mechanism yet (except for logging), cf. options.
Could you provide more specific combinations for the desired scenarios? For example: Scenario one: 5 mets:fileGrp, 10 mets:file per mets:fileGrp, 1000 physical pages with 10 mets:fptr each. etc.
I am getting confused by the different possible combinations. Especially, when I do not know what combinations are possible to be created by the different OCR-D processors.
The current benchmark scenarios are based on:
- 8 file groups
- 7 group regions
- 10 regions per group region (70 regions in total)
- 2 fileId per fileGrp
- 14 + 70 = 84 files per page
So, for 20/50/100 pages we have respectively 1680/4200/8400 fptr.
No, sorry, but I have no advice on specific numbers for such combinations. Your choices seem to be some "small to medium average" scenario. What I called for was adding more extreme scenarios – not necessarily under assets and not necessarily all possible combinations.

@bertsky, here are the results for 50-500-1000-2000-5000 pages. I forced iterations to 1 because it was already taking 3 days for 5000 pages (non-cached) to finish. I have also decreased the number of regions from 10 to 2 per page. With the current settings, I got all results in just a little over 24 hours. I included 50 pages as well for comparison with the previous results of 50 pages (10 regions vs 2 regions per page).
I have also created 4 groups of tests (build, build-cached, search, search-cached) instead of 2 (previously). Since the time in the table is given in ms and s I will convert some of these to hours in my comment for easier comparison.
Building (non-cached) vs Building (cached): 50 pages: 1.140 (s) vs 95.00 (ms) 500 pages: 133.3 (s) (or 2.222 m) vs 5.800 (s) 1000 pages: 700.7 (s) (or 11.68 m) vs 26.87 (s) 2000 pages: 3 562 (s) (or 59.37 m) vs 145.7 (s) (or 2.428 m) 5000 pages: 27 059 (s) (or 7.516 h) vs 1 449 (s) (or 24.15 m)
Where: ms = milliseconds, s = seconds, m = minutes, h =hours
The searching methods execute search for fileGrp, fileID and physical page. For each, the best-case scenario (first hit) and worst-case scenario (not existing in the cache/element tree) searching parameters are tried.
Searching (non-cached) vs Searching (cached): 50 pages: 4.833 (ms) vs 2.181 (ms) 500 pages: 71.88 (ms) vs 20.88 (ms) 1000 pages: 199.5 (ms) vs 50.47 (ms) 2000 pages: 509.0 (ms) vs 106.6 (ms) 5000 pages: 1528 (ms) (or 1.528 s) vs 284.7 (ms)
Issue: https://github.com/OCR-D/zenhub/issues/7
Thanks @MehmedGIT, very good!
Perhaps you could include your local results as a Markdown table into ocrd_models/README.md or a new file under tests/model/README.md?
Still missing IIUC:
- fileGrps with hundreds of files per page (think
ocrd-cis-ocropy-dewarpfor line level image files orocrd-cis-ocropy-deskew/ocrd-tesserocr-deskewfor region level files)
(in particular, hundreds of files per pages combined with hundreds or even thousands of pages – this is realistic with dewarping in the workflow)
And a switch to enable/disable caching – ideally somewhere central, but we don't have a general-purpose configuration mechanism yet (except for logging), cf. options.
@kba your thoughts on this aspect?
Perhaps you could include your local results as a Markdown table into
ocrd_models/README.mdor a new file undertests/model/README.md?
@bertsky, yes, I will do that once I have the additional extreme results (below).
(in particular, hundreds of files per pages combined with hundreds or even thousands of pages – this is realistic with dewarping in the workflow)
I have started now benchmarking with 750 files per page and 5000 pages - both for cached and non-cached. Please be patient about the results since the non-cached version would probably take a lot of processing time. Just for comparison - the results above for 5000 pages (30 files per page) took 7.5 hours.
@bertsky,
I have canceled the test execution because even the building of the mets file for the non-cached version has not finished in almost 5 days. Since I could not use my work laptop for anything else efficiently due to the 100% CPU usage of the test, I had to cancel the test. I have pushed the test case for 750 files per page and 5000 pages if someone wants to play and further test with different numbers for files and pages.

@MehmedGIT understood. (Perhaps a server machine or SSH build on CircleCI would help with your resource limitation.)
It would help at least knowing how much that test took with caching enabled, though...
@MehmedGIT understood. (Perhaps a server machine or SSH build on CircleCI would help with your resource limitation.)
It would help at least knowing how much that test took with caching enabled, though...
@bertsky, I will execute the tests again on a cloud VM provided by my company. I was also thinking about whether 750 files per page is more than what we need. Do you think 300-400 files per page is still good enough? What is the theoretical maximum for files per page, if there are some?
Another thing I could do is to increase the files per page but reduce the number of pages. Say 1500 files per page but "only" 2000 pages? 2000 pages are still roughly 5 times above the average page numbers from the statistics I have.

Do you think 300-400 files per page is still good enough?
yes
What is the theoretical maximum for files per page, if there are some?
there's nothing I can think of. Practically, we might see newspaper pages with 1000 lines or so. Also, for the typegroups classification case, there's even a use-case for word-level image files. But these are rather rare cases. And we already get the picture by running a more conservative high-file count case. Especially if your statistics allow breaking down page count scaling behaviour.
Another thing I could do is to increase the files per page but reduce the number of pages. Say 1500 files per page but "only" 2000 pages? 2000 pages are still roughly 5 times above the average page numbers from the statistics I have.
Yes. By the same logic, I would recommend reducing the page number to under 1000.
Since the caching_functionality branch was behind by 150 commits from the master, I merged the master. The test example with 1500 files per page and 500 pages is running inside a cloud VM. Once I have the results, I will publish them.
@bertsky, here is the important part of the output for results for 1500 files per page and 500 pages. As I mentioned in the tech call, searching tests have failed.
--------------------------------------------------- benchmark 'build': 1 tests --------------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
---------------------------------------------------------------------------------------------------------------------------------
test_b500 204,940.3885 204,940.3885 204,940.3885 0.0000 204,940.3885 0.0000 0;0 0.0000 1 1
---------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------- benchmark 'build_cached': 1 tests ---------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-----------------------------------------------------------------------------------------------------------------------------
test_b500_c 10,047.6235 10,047.6235 10,047.6235 0.0000 10,047.6235 0.0000 0;0 0.0001 1 1
-----------------------------------------------------------------------------------------------------------------------------
Legend:
Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
OPS: Operations Per Second, computed as 1 / Mean
=========================== short test summary info ============================
FAILED mets_bench_extreme_additional.py::test_s500 - assert 1500 == 716
FAILED mets_bench_extreme_additional.py::test_s500_c - assert 1500 == 716
============= 2 failed, 2 passed in 641994.41s (7 days, 10:19:54) ==============
And this is the entire content of the output file:
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/cloud/core/tests/model
plugins: benchmark-3.4.1
collected 4 items
mets_bench_extreme_additional.py .F.F [100%]
=================================== FAILURES ===================================
__________________________________ test_s500 ___________________________________
benchmark = <pytest_benchmark.fixture.BenchmarkFixture object at 0x7f34a3ee4b50>
@mark.benchmark(group="search", max_time=0.1, min_rounds=1, disable_gc=False, warmup=False)
def test_s500(benchmark):
@benchmark
> def ret():
mets_bench_extreme_additional.py:91:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../venv-core-dev/lib/python3.8/site-packages/pytest_benchmark/fixture.py:127: in __call__
return self._raw(function_to_benchmark, *args, **kwargs)
../../../venv-core-dev/lib/python3.8/site-packages/pytest_benchmark/fixture.py:149: in _raw
duration, iterations, loops_range = self._calibrate_timer(runner)
../../../venv-core-dev/lib/python3.8/site-packages/pytest_benchmark/fixture.py:277: in _calibrate_timer
duration = runner(loops_range)
../../../venv-core-dev/lib/python3.8/site-packages/pytest_benchmark/fixture.py:92: in runner
function_to_benchmark(*args, **kwargs)
mets_bench_extreme_additional.py:93: in ret
benchmark_find_files(500, mets_500)
mets_bench_extreme_additional.py:51: in benchmark_find_files
benchmark_find_files_physical_page(number_of_pages, mets)
mets_bench_extreme_additional.py:71: in benchmark_find_files_physical_page
assert_len(FILES_PER_PAGE, mets, dict(pageId='PHYS_0001'))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
expected_len = 1500
mets = <ocrd_models.ocrd_mets.OcrdMets object at 0x7f34c4432f70>
kwargs = {'pageId': 'PHYS_0001'}
def assert_len(expected_len, mets, kwargs):
test_list = mets.find_all_files(**kwargs)
> assert expected_len == len(test_list)
E assert 1500 == 716
E + where 716 = len([<ocrd_models.ocrd_file.OcrdFile object at 0x7f34a3e8e850>, <ocrd_models.ocrd_file.OcrdFile object at 0x7f34a3e8fac0>,...rd_models.ocrd_file.OcrdFile object at 0x7f34a3e8fe80>, <ocrd_models.ocrd_file.OcrdFile object at 0x7f34a3e8fee0>, ...])
mets_bench_extreme_additional.py:46: AssertionError
_________________________________ test_s500_c __________________________________
benchmark = <pytest_benchmark.fixture.BenchmarkFixture object at 0x7f34a3e8fb80>
@mark.benchmark(group="search_cached", max_time=0.1, min_rounds=1, disable_gc=False, warmup=False)
def test_s500_c(benchmark):
@benchmark
> def ret():
mets_bench_extreme_additional.py:107:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../venv-core-dev/lib/python3.8/site-packages/pytest_benchmark/fixture.py:127: in __call__
return self._raw(function_to_benchmark, *args, **kwargs)
../../../venv-core-dev/lib/python3.8/site-packages/pytest_benchmark/fixture.py:149: in _raw
duration, iterations, loops_range = self._calibrate_timer(runner)
../../../venv-core-dev/lib/python3.8/site-packages/pytest_benchmark/fixture.py:277: in _calibrate_timer
duration = runner(loops_range)
../../../venv-core-dev/lib/python3.8/site-packages/pytest_benchmark/fixture.py:92: in runner
function_to_benchmark(*args, **kwargs)
mets_bench_extreme_additional.py:109: in ret
benchmark_find_files(500, mets_c_500)
mets_bench_extreme_additional.py:51: in benchmark_find_files
benchmark_find_files_physical_page(number_of_pages, mets)
mets_bench_extreme_additional.py:71: in benchmark_find_files_physical_page
assert_len(FILES_PER_PAGE, mets, dict(pageId='PHYS_0001'))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
expected_len = 1500
mets = <ocrd_models.ocrd_mets.OcrdMets object at 0x7f34a3ee4a00>
kwargs = {'pageId': 'PHYS_0001'}
def assert_len(expected_len, mets, kwargs):
test_list = mets.find_all_files(**kwargs)
> assert expected_len == len(test_list)
E assert 1500 == 716
E + where 716 = len([<ocrd_models.ocrd_file.OcrdFile object at 0x7f34a3e87850>, <ocrd_models.ocrd_file.OcrdFile object at 0x7f34a3e87b50>,...rd_models.ocrd_file.OcrdFile object at 0x7f34a3e87970>, <ocrd_models.ocrd_file.OcrdFile object at 0x7f34a3e87cd0>, ...])
mets_bench_extreme_additional.py:46: AssertionError
--------------------------------------------------- benchmark 'build': 1 tests --------------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
---------------------------------------------------------------------------------------------------------------------------------
test_b500 204,940.3885 204,940.3885 204,940.3885 0.0000 204,940.3885 0.0000 0;0 0.0000 1 1
---------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------- benchmark 'build_cached': 1 tests ---------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-----------------------------------------------------------------------------------------------------------------------------
test_b500_c 10,047.6235 10,047.6235 10,047.6235 0.0000 10,047.6235 0.0000 0;0 0.0001 1 1
-----------------------------------------------------------------------------------------------------------------------------
Legend:
Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
OPS: Operations Per Second, computed as 1 / Mean
=========================== short test summary info ============================
FAILED mets_bench_extreme_additional.py::test_s500 - assert 1500 == 716
FAILED mets_bench_extreme_additional.py::test_s500_c - assert 1500 == 716
============= 2 failed, 2 passed in 641994.41s (7 days, 10:19:54) ==============
I have now cleaned up the test setup and merged #955 into this, I think this is ready for merge.
- The benchmarks are not run as part of
make test, but withmake benchmarkormake benchmark-extreme. For the CI we runmake test benchmarks. - The
test_ocrd_mets.pywill now test both uncached and cached OcrdMets
What is still missing for release?
- CLI flags
- anything else? @MehmedGIT
It's finished now but instead of CLI flags, we've opted for an environment variable OCRD_METS_CACHING which must be set to true to globally enable the METS caching.
We decided against CLI flags, because that would require a lot of changes for all the places in the code that work with METS, especially passing the cache_flag value through function calls until the METS constructor is called.
We need to tackle a proper, unified configuration option soon and then align the caching behavior with it.
Great to hear this is finally ready. My understanding was that we merge https://github.com/OCR-D/core/pull/678 first and see how this affects RSS (or PSS) througout your test scenarios. Do you have any measurements like that already?
(And since you did decide in favour of environment variables here, should that affect the discussion about profiling settings as well?)
Great to hear this is finally ready. My understanding was that we merge #678 first and see how this affects RSS (or PSS) througout your test scenarios. Do you have any measurements like that already?
@MehmedGIT Do we?
(And since you did decide in favour of environment variables here, should that affect the discussion about profiling settings as well?)
For now, I would use environment variables like about:flags in browsers. We document the behavior but require an extra step (setting the env var). For the long-term, I would now prefer a configuration file with env var overrides. We should discuss that in the tech calls and/or our workshop in December.
Great to hear this is finally ready. My understanding was that we merge https://github.com/OCR-D/core/pull/678 first and see how this affects RSS (or PSS) througout your test scenarios. Do you have any measurements like that already?
No, we don't. I was waiting for #678 to be merged to test the memory consumption. However, there are Nextflow reports for the cached and non-cached executions of the same workflow on the same workspace here. As you know from my previous presentation of these results, interpreting the memory usage values from Nextflow reports was a bit hard task.
@bertsky, @kba I would also refer again to this summary #946. From the table, we could see the comparisons and that memory usage does not increase in all cases when caching is used.