core icon indicating copy to clipboard operation
core copied to clipboard

Cache functionality

Open MehmedGIT opened this issue 3 years ago • 15 comments
trafficstars

MehmedGIT avatar May 31 '22 07:05 MehmedGIT

Also, why have separate _fileGrp_cache and _file_cache - doesn't the latter also contain the fileGrp info in a fast-to-access way? What would be more beneficial would be caching the pageId of a file because that requires a bit of a convoluted XPath to look up on-the-fly.

kba avatar May 31 '22 10:05 kba

Also, why have separate _fileGrp_cache and _file_cache - doesn't the latter also contain the fileGrp info in a fast-to-access way? What would be more beneficial would be caching the pageId of a file because that requires a bit of a convoluted XPath to look up on-the-fly.

I think this should be explanatory enough:

# Cache for the fileGrps (mets:fileGrp) - a dictionary with Key and Value pair:
# Key: `'fileGrp.USE`
# Value: a 'fileGrp' object at some memory location 
self._fileGrp_cache = {}

# Cache for the files (mets:file) - two nested dictionaries
# The outer dictionary's Key: 'fileGrp.USE'
# The outer dictionary's Value: Inner dictionary
# The inner dictionary's Key: 'file.ID'
# The inner dictionary's Value: a 'file' object at some memory location
self._file_cache = {}

The _fileGrp_cache holds a reference to the fileGrp element itself aka the memory location. The _file_cache holds just a string of the fileGrp.

Having a second thought on it. Maybe we do not benefit that much from holding fileGrp elements inside a separate cache. We can just search the tree when we need a specific element.

MehmedGIT avatar May 31 '22 13:05 MehmedGIT

The _fileGrp_cache holds a reference to the fileGrp element itself aka the memory location. The _file_cache holds just a string of the fileGrp.

I get that, but the main/only use for querying about fileGrp is: Does this fileGrp exist (a) and what files are in it (b)? These can be answered with the _file_cache, by checking needle in _file_cache (a) and for el_file in self._file_cache[needle]: (b). Since the fileGrp element itself contains no other information in the METS that I'm aware of, it seems redundant to keep it around.

kba avatar May 31 '22 15:05 kba

Screenshot from 2022-06-07 17-01-51

Building the METS file with 50 files inside got improved from 8,7s down to 508ms! Here we go, less electricity consumption for everyone :)

MehmedGIT avatar Jun 07 '22 15:06 MehmedGIT

Related issue: https://github.com/OCR-D/core/issues/723

Related discussion: https://github.com/OCR-D/zenhub/issues/39

Still missing IIUC:

I would suggest diversifying test scenarios BTW: * thousands of pages * dozens of fileGrps * fileGrps with hundreds of files per page (think ocrd-cis-ocropy-dewarp for line level image files or ocrd-cis-ocropy-deskew / ocrd-tesserocr-deskew for region level files)

And a switch to enable/disable caching – ideally somewhere central, but we don't have a general-purpose configuration mechanism yet (except for logging), cf. options.

bertsky avatar Jun 22 '22 12:06 bertsky

Related issue: #723

Related discussion: OCR-D/zenhub#39

Still missing IIUC:

I would suggest diversifying test scenarios BTW:

  • thousands of pages
  • dozens of fileGrps
  • fileGrps with hundreds of files per page (think ocrd-cis-ocropy-dewarp for line level image files or ocrd-cis-ocropy-deskew / ocrd-tesserocr-deskew for region level files)

And a switch to enable/disable caching – ideally somewhere central, but we don't have a general-purpose configuration mechanism yet (except for logging), cf. options.

Could you provide more specific combinations for the desired scenarios? For example: Scenario one: 5 mets:fileGrp, 10 mets:file per mets:fileGrp, 1000 physical pages with 10 mets:fptr each. etc.

I am getting confused by the different possible combinations. Especially, when I do not know what combinations are possible to be created by the different OCR-D processors.

The current benchmark scenarios are based on:

  • 8 file groups
  • 7 group regions
  • 10 regions per group region (70 regions in total)
  • 2 fileId per fileGrp
  • 14 + 70 = 84 files per page

So, for 20/50/100 pages we have respectively 1680/4200/8400 fptr.

MehmedGIT avatar Jun 22 '22 13:06 MehmedGIT

No, sorry, but I have no advice on specific numbers for such combinations. Your choices seem to be some "small to medium average" scenario. What I called for was adding more extreme scenarios – not necessarily under assets and not necessarily all possible combinations.

bertsky avatar Jun 23 '22 10:06 bertsky

Screenshot from 2022-06-28 06-39-30

@bertsky, here are the results for 50-500-1000-2000-5000 pages. I forced iterations to 1 because it was already taking 3 days for 5000 pages (non-cached) to finish. I have also decreased the number of regions from 10 to 2 per page. With the current settings, I got all results in just a little over 24 hours. I included 50 pages as well for comparison with the previous results of 50 pages (10 regions vs 2 regions per page).

I have also created 4 groups of tests (build, build-cached, search, search-cached) instead of 2 (previously). Since the time in the table is given in ms and s I will convert some of these to hours in my comment for easier comparison.

Building (non-cached) vs Building (cached): 50 pages: 1.140 (s) vs 95.00 (ms) 500 pages: 133.3 (s) (or 2.222 m) vs 5.800 (s) 1000 pages: 700.7 (s) (or 11.68 m) vs 26.87 (s) 2000 pages: 3 562 (s) (or 59.37 m) vs 145.7 (s) (or 2.428 m) 5000 pages: 27 059 (s) (or 7.516 h) vs 1 449 (s) (or 24.15 m)

Where: ms = milliseconds, s = seconds, m = minutes, h =hours

The searching methods execute search for fileGrp, fileID and physical page. For each, the best-case scenario (first hit) and worst-case scenario (not existing in the cache/element tree) searching parameters are tried.

Searching (non-cached) vs Searching (cached): 50 pages: 4.833 (ms) vs 2.181 (ms) 500 pages: 71.88 (ms) vs 20.88 (ms) 1000 pages: 199.5 (ms) vs 50.47 (ms) 2000 pages: 509.0 (ms) vs 106.6 (ms) 5000 pages: 1528 (ms) (or 1.528 s) vs 284.7 (ms)

Issue: https://github.com/OCR-D/zenhub/issues/7

MehmedGIT avatar Jun 28 '22 09:06 MehmedGIT

Thanks @MehmedGIT, very good!

Perhaps you could include your local results as a Markdown table into ocrd_models/README.md or a new file under tests/model/README.md?

Still missing IIUC:

  • fileGrps with hundreds of files per page (think ocrd-cis-ocropy-dewarp for line level image files or ocrd-cis-ocropy-deskew / ocrd-tesserocr-deskew for region level files)

(in particular, hundreds of files per pages combined with hundreds or even thousands of pages – this is realistic with dewarping in the workflow)

And a switch to enable/disable caching – ideally somewhere central, but we don't have a general-purpose configuration mechanism yet (except for logging), cf. options.

@kba your thoughts on this aspect?

bertsky avatar Aug 22 '22 13:08 bertsky

Perhaps you could include your local results as a Markdown table into ocrd_models/README.md or a new file under tests/model/README.md?

@bertsky, yes, I will do that once I have the additional extreme results (below).

(in particular, hundreds of files per pages combined with hundreds or even thousands of pages – this is realistic with dewarping in the workflow)

I have started now benchmarking with 750 files per page and 5000 pages - both for cached and non-cached. Please be patient about the results since the non-cached version would probably take a lot of processing time. Just for comparison - the results above for 5000 pages (30 files per page) took 7.5 hours.

MehmedGIT avatar Sep 29 '22 11:09 MehmedGIT

@bertsky,

I have canceled the test execution because even the building of the mets file for the non-cached version has not finished in almost 5 days. Since I could not use my work laptop for anything else efficiently due to the 100% CPU usage of the test, I had to cancel the test. I have pushed the test case for 750 files per page and 5000 pages if someone wants to play and further test with different numbers for files and pages.

Screenshot from 2022-10-04 12-16-52

MehmedGIT avatar Oct 04 '22 10:10 MehmedGIT

@MehmedGIT understood. (Perhaps a server machine or SSH build on CircleCI would help with your resource limitation.)

It would help at least knowing how much that test took with caching enabled, though...

bertsky avatar Oct 04 '22 12:10 bertsky

@MehmedGIT understood. (Perhaps a server machine or SSH build on CircleCI would help with your resource limitation.)

It would help at least knowing how much that test took with caching enabled, though...

@bertsky, I will execute the tests again on a cloud VM provided by my company. I was also thinking about whether 750 files per page is more than what we need. Do you think 300-400 files per page is still good enough? What is the theoretical maximum for files per page, if there are some?

Another thing I could do is to increase the files per page but reduce the number of pages. Say 1500 files per page but "only" 2000 pages? 2000 pages are still roughly 5 times above the average page numbers from the statistics I have.

Seiten

MehmedGIT avatar Oct 07 '22 12:10 MehmedGIT

Do you think 300-400 files per page is still good enough?

yes

What is the theoretical maximum for files per page, if there are some?

there's nothing I can think of. Practically, we might see newspaper pages with 1000 lines or so. Also, for the typegroups classification case, there's even a use-case for word-level image files. But these are rather rare cases. And we already get the picture by running a more conservative high-file count case. Especially if your statistics allow breaking down page count scaling behaviour.

Another thing I could do is to increase the files per page but reduce the number of pages. Say 1500 files per page but "only" 2000 pages? 2000 pages are still roughly 5 times above the average page numbers from the statistics I have.

Yes. By the same logic, I would recommend reducing the page number to under 1000.

bertsky avatar Oct 07 '22 13:10 bertsky

Since the caching_functionality branch was behind by 150 commits from the master, I merged the master. The test example with 1500 files per page and 500 pages is running inside a cloud VM. Once I have the results, I will publish them.

MehmedGIT avatar Oct 12 '22 18:10 MehmedGIT

@bertsky, here is the important part of the output for results for 1500 files per page and 500 pages. As I mentioned in the tech call, searching tests have failed.

--------------------------------------------------- benchmark 'build': 1 tests --------------------------------------------------
Name (time in s)              Min           Max          Mean  StdDev        Median     IQR  Outliers     OPS  Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------
test_b500            204,940.3885  204,940.3885  204,940.3885  0.0000  204,940.3885  0.0000       0;0  0.0000       1           1
---------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------- benchmark 'build_cached': 1 tests ---------------------------------------------
Name (time in s)             Min          Max         Mean  StdDev       Median     IQR  Outliers     OPS  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------
test_b500_c          10,047.6235  10,047.6235  10,047.6235  0.0000  10,047.6235  0.0000       0;0  0.0001       1           1
-----------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
=========================== short test summary info ============================
FAILED mets_bench_extreme_additional.py::test_s500 - assert 1500 == 716
FAILED mets_bench_extreme_additional.py::test_s500_c - assert 1500 == 716
============= 2 failed, 2 passed in 641994.41s (7 days, 10:19:54) ==============

And this is the entire content of the output file:

============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.0.0
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/cloud/core/tests/model
plugins: benchmark-3.4.1
collected 4 items

mets_bench_extreme_additional.py .F.F                                    [100%]

=================================== FAILURES ===================================
__________________________________ test_s500 ___________________________________

benchmark = <pytest_benchmark.fixture.BenchmarkFixture object at 0x7f34a3ee4b50>

    @mark.benchmark(group="search", max_time=0.1, min_rounds=1, disable_gc=False, warmup=False)
    def test_s500(benchmark):
        @benchmark
>       def ret():

mets_bench_extreme_additional.py:91: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../venv-core-dev/lib/python3.8/site-packages/pytest_benchmark/fixture.py:127: in __call__
    return self._raw(function_to_benchmark, *args, **kwargs)
../../../venv-core-dev/lib/python3.8/site-packages/pytest_benchmark/fixture.py:149: in _raw
    duration, iterations, loops_range = self._calibrate_timer(runner)
../../../venv-core-dev/lib/python3.8/site-packages/pytest_benchmark/fixture.py:277: in _calibrate_timer
    duration = runner(loops_range)
../../../venv-core-dev/lib/python3.8/site-packages/pytest_benchmark/fixture.py:92: in runner
    function_to_benchmark(*args, **kwargs)
mets_bench_extreme_additional.py:93: in ret
    benchmark_find_files(500, mets_500)
mets_bench_extreme_additional.py:51: in benchmark_find_files
    benchmark_find_files_physical_page(number_of_pages, mets)
mets_bench_extreme_additional.py:71: in benchmark_find_files_physical_page
    assert_len(FILES_PER_PAGE, mets, dict(pageId='PHYS_0001'))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

expected_len = 1500
mets = <ocrd_models.ocrd_mets.OcrdMets object at 0x7f34c4432f70>
kwargs = {'pageId': 'PHYS_0001'}

    def assert_len(expected_len, mets, kwargs):
        test_list = mets.find_all_files(**kwargs)
>       assert expected_len == len(test_list)
E       assert 1500 == 716
E        +  where 716 = len([<ocrd_models.ocrd_file.OcrdFile object at 0x7f34a3e8e850>, <ocrd_models.ocrd_file.OcrdFile object at 0x7f34a3e8fac0>,...rd_models.ocrd_file.OcrdFile object at 0x7f34a3e8fe80>, <ocrd_models.ocrd_file.OcrdFile object at 0x7f34a3e8fee0>, ...])

mets_bench_extreme_additional.py:46: AssertionError
_________________________________ test_s500_c __________________________________

benchmark = <pytest_benchmark.fixture.BenchmarkFixture object at 0x7f34a3e8fb80>

    @mark.benchmark(group="search_cached", max_time=0.1, min_rounds=1, disable_gc=False, warmup=False)
    def test_s500_c(benchmark):
        @benchmark
>       def ret():

mets_bench_extreme_additional.py:107: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../venv-core-dev/lib/python3.8/site-packages/pytest_benchmark/fixture.py:127: in __call__
    return self._raw(function_to_benchmark, *args, **kwargs)
../../../venv-core-dev/lib/python3.8/site-packages/pytest_benchmark/fixture.py:149: in _raw
    duration, iterations, loops_range = self._calibrate_timer(runner)
../../../venv-core-dev/lib/python3.8/site-packages/pytest_benchmark/fixture.py:277: in _calibrate_timer
    duration = runner(loops_range)
../../../venv-core-dev/lib/python3.8/site-packages/pytest_benchmark/fixture.py:92: in runner
    function_to_benchmark(*args, **kwargs)
mets_bench_extreme_additional.py:109: in ret
    benchmark_find_files(500, mets_c_500)
mets_bench_extreme_additional.py:51: in benchmark_find_files
    benchmark_find_files_physical_page(number_of_pages, mets)
mets_bench_extreme_additional.py:71: in benchmark_find_files_physical_page
    assert_len(FILES_PER_PAGE, mets, dict(pageId='PHYS_0001'))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

expected_len = 1500
mets = <ocrd_models.ocrd_mets.OcrdMets object at 0x7f34a3ee4a00>
kwargs = {'pageId': 'PHYS_0001'}

    def assert_len(expected_len, mets, kwargs):
        test_list = mets.find_all_files(**kwargs)
>       assert expected_len == len(test_list)
E       assert 1500 == 716
E        +  where 716 = len([<ocrd_models.ocrd_file.OcrdFile object at 0x7f34a3e87850>, <ocrd_models.ocrd_file.OcrdFile object at 0x7f34a3e87b50>,...rd_models.ocrd_file.OcrdFile object at 0x7f34a3e87970>, <ocrd_models.ocrd_file.OcrdFile object at 0x7f34a3e87cd0>, ...])

mets_bench_extreme_additional.py:46: AssertionError

--------------------------------------------------- benchmark 'build': 1 tests --------------------------------------------------
Name (time in s)              Min           Max          Mean  StdDev        Median     IQR  Outliers     OPS  Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------
test_b500            204,940.3885  204,940.3885  204,940.3885  0.0000  204,940.3885  0.0000       0;0  0.0000       1           1
---------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------- benchmark 'build_cached': 1 tests ---------------------------------------------
Name (time in s)             Min          Max         Mean  StdDev       Median     IQR  Outliers     OPS  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------
test_b500_c          10,047.6235  10,047.6235  10,047.6235  0.0000  10,047.6235  0.0000       0;0  0.0001       1           1
-----------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
=========================== short test summary info ============================
FAILED mets_bench_extreme_additional.py::test_s500 - assert 1500 == 716
FAILED mets_bench_extreme_additional.py::test_s500_c - assert 1500 == 716
============= 2 failed, 2 passed in 641994.41s (7 days, 10:19:54) ==============

MehmedGIT avatar Oct 26 '22 13:10 MehmedGIT

I have now cleaned up the test setup and merged #955 into this, I think this is ready for merge.

  • The benchmarks are not run as part of make test, but with make benchmark or make benchmark-extreme. For the CI we run make test benchmarks.
  • The test_ocrd_mets.py will now test both uncached and cached OcrdMets

What is still missing for release?

  • CLI flags
  • anything else? @MehmedGIT

kba avatar Nov 20 '22 15:11 kba

It's finished now but instead of CLI flags, we've opted for an environment variable OCRD_METS_CACHING which must be set to true to globally enable the METS caching.

We decided against CLI flags, because that would require a lot of changes for all the places in the code that work with METS, especially passing the cache_flag value through function calls until the METS constructor is called.

We need to tackle a proper, unified configuration option soon and then align the caching behavior with it.

kba avatar Nov 22 '22 16:11 kba

Great to hear this is finally ready. My understanding was that we merge https://github.com/OCR-D/core/pull/678 first and see how this affects RSS (or PSS) througout your test scenarios. Do you have any measurements like that already?

(And since you did decide in favour of environment variables here, should that affect the discussion about profiling settings as well?)

bertsky avatar Nov 22 '22 16:11 bertsky

Great to hear this is finally ready. My understanding was that we merge #678 first and see how this affects RSS (or PSS) througout your test scenarios. Do you have any measurements like that already?

@MehmedGIT Do we?

(And since you did decide in favour of environment variables here, should that affect the discussion about profiling settings as well?)

For now, I would use environment variables like about:flags in browsers. We document the behavior but require an extra step (setting the env var). For the long-term, I would now prefer a configuration file with env var overrides. We should discuss that in the tech calls and/or our workshop in December.

kba avatar Nov 23 '22 10:11 kba

Great to hear this is finally ready. My understanding was that we merge https://github.com/OCR-D/core/pull/678 first and see how this affects RSS (or PSS) througout your test scenarios. Do you have any measurements like that already?

No, we don't. I was waiting for #678 to be merged to test the memory consumption. However, there are Nextflow reports for the cached and non-cached executions of the same workflow on the same workspace here. As you know from my previous presentation of these results, interpreting the memory usage values from Nextflow reports was a bit hard task.

MehmedGIT avatar Nov 23 '22 10:11 MehmedGIT

@bertsky, @kba I would also refer again to this summary #946. From the table, we could see the comparisons and that memory usage does not increase in all cases when caching is used.

MehmedGIT avatar Nov 23 '22 11:11 MehmedGIT