core performance of input_files on large workspaces

trafficstars

By implementing #635 to properly handle all cases of PAGE-XML file matching per pageId, we have lost sight of the severe performance penalty that this comes with. In effect, we are now nearly as slow as before #482 on workspaces with lots of pages and fileGrps.

Here's a typical scenario:

ocrd-cis-ocropy-dewarp creates an image file for each text line, and references it under the pageId and fileGrp which it belongs to – under an image mimetype (this creates 29.000 files for me in a workspace with 500 pages)
ocrd-tesserocr-recognize runs afterwards and queries its self.input_files: https://github.com/OCR-D/core/blob/9069a6581f37ec1c189e8cfaa62692fb66004964/ocrd/ocrd/processor/base.py#L294-L299
this searches through all mets:file entries, matching them for fileGrp (which is reasonably fast, it only gets a little inefficient when additionally filtering by pageId): https://github.com/OCR-D/core/blob/9069a6581f37ec1c189e8cfaa62692fb66004964/ocrd_models/ocrd_models/ocrd_mets.py#L176-L208
Then in line 298 (and again further below) it queries OcrdFile.pageId: https://github.com/OCR-D/core/blob/9069a6581f37ec1c189e8cfaa62692fb66004964/ocrd_models/ocrd_models/ocrd_file.py#L116-L122
This in turn needs to repeatedly query the whole structMap via XPath (which on a workspace with 500 files and 25 fileGrps and 200.000 takes about 0.2sec per file, i.e. needs more than 1h just for the computation of input_files): https://github.com/OCR-D/core/blob/9069a6581f37ec1c189e8cfaa62692fb66004964/ocrd_models/ocrd_models/ocrd_mets.py#L434-L441

A little cosmetics like turning OcrdFile.pageId into a functools.cached_property won't help here, the problem is bigger. METS with its mutually related fileGrp and pageId mappings is inherently expensive to parse. I know we have in the past decided against in-memory representations like dicts because that looked like memory leaks or seemed too expensive on very large workspaces. But have we really weighed the cost of that memory-cputime tradeoff carefully (and considering the necessity for pageId/mimetype filtering) yet? Is there any existing code attempting to cache fileGrp and pageId mappings to avoid reparsing the METS again and again, which I could tamper with?

Sep 28 '21 10:09 bertsky

5. This in turn needs to repeatedly query the whole structMap via XPath (which on a workspace with 500 files and 25 fileGrps and 200.000 takes about 0.2sec per file, i.e. needs more than 1h just for the computation of input_files):

This is obviously bad and needs fixing :(

Is there any existing code attempting to cache fileGrp and pageId mappings to avoid reparsing the METS again and again, which I could tamper with?

I did work on an OcrdMetsFilter class for more fine-grained control for cloning workspaces but it was supposed to be used wherever METS is to be queried, cf. #582. One plan I had was to cache pageId in there to take the penalty of parsing everything just once. Had to abandon the PR for other business, not even sure how much effort it would be to merge master into it. But maybe it can serve as a starrting point for a discussion on how to improve the query performance.

Sep 28 '21 11:09 kba

I did work on an OcrdMetsFilter class for more fine-grained control for cloning workspaces but it was supposed to be used wherever METS is to be queried, cf. #582. One plan I had was to cache pageId in there to take the penalty of parsing everything just once.

Okay, so this is an elaborate filter for find_files, allowing for negative tests as well (but before we had .. ranges for pageId). The function itself has the same structure though. I don't see just where the pageId caching should happen. But it could use some cached pageId2fileId mapping (which will speed up any non-first call to OcrdMets.find_files). With the above Processor.input_files loop however, it's the other direction we need, fileId2pageId (which will speed up any non-first call to OcrdFile.pageId and reduce the complexity for Processor.input_files).

I think that we should try to parse mets:fileSec and mets:structMap only once (into fileId/pageId and fileId/fileGrp dicts which we can reverse if needed): during Workspace's or OcrdMets's constructor (which is run in Resolver.workspace_from_url in processor tasks and in ocrd.cli.workspace).

Sep 29 '21 14:09 bertsky

Just to add another use-case to the scenario: even a simple OcrdMets.add_file can become inefficient on large workspaces. (Becoming as slow as 1 op/sec.) The reason is that it looks for existing files of the same ID first:

https://github.com/OCR-D/core/blob/ad32b00bf692c71a43dacac805e324625bbc447a/ocrd_models/ocrd_models/ocrd_mets.py#L304

So the proposed caching should also happen here.

Apr 30 '22 21:04 bertsky

@MehmedGIT what's your status of the OcrdMets profiling experiment?

Apr 30 '22 21:04 bertsky

@bertsky I have pushed my latest changes to the benchmarking branch. I have not been working on that experiment after that. @mweidling is investigating this topic in more depth and I am available for discussions and support if needed. My personal opinion is that we should try to optimize the OcrdMets functionalities as soon as possible.

May 01 '22 13:05 MehmedGIT

Just to add another use-case to the scenario: even a simple OcrdMets.add_file can become inefficient on large workspaces. (Becoming as slow as 1 op/sec.) The reason is that it looks for existing files of the same ID first:

https://github.com/OCR-D/core/blob/ad32b00bf692c71a43dacac805e324625bbc447a/ocrd_models/ocrd_models/ocrd_mets.py#L304

So the proposed caching should also happen here.

Thanks for pointing that out, @bertsky !

May 02 '22 06:05 mweidling

@MehmedGIT I don't understand: the benchmark-mets branch does not seem to contain any actual changes to the modules, only additional tests in benchmarks (i.e. outside the normal test set). Also, under test_mets2.py it contains a ExtendedOcrdMets class with some caching.

Could it be that you actually need to incorporate these changes into ocrd_models.ocrd_mets.OcrdMets proper, so that one can see the diff and test in situ?

May 02 '22 06:05 bertsky

@bertsky you are right. I have not changed the actual modules. In order to implement my own version of the functions I have extended the OcrdMets class. The reason I did that was because I did not know how to compile just the OcrdMets class alone 😅 and found fast solution that way. Moreover, when I implemented ExtendedOcrdMets class it was more like a proof of concept to see if the functions can be optimized and to get some comparison results to trigger the further discussion. The changes in the bechmarks branch are not a proper implementation ready to be merged.

May 02 '22 07:05 MehmedGIT

@MehmedGIT I see, thanks.

May 02 '22 07:05 bertsky

core core copied to clipboard

performance of input_files on large workspaces

core
core copied to clipboard