core icon indicating copy to clipboard operation
core copied to clipboard

Additional parameter for custom resolution

Open M3ssman opened this issue 5 years ago • 6 comments
trafficstars

Hello,

please add DPI-Parameters to enable to enforce custom resolution when using tesseract

Tesseract CLI

--dpi 470

ocrd-tesserocr

dpi: 470

M3ssman avatar Dec 05 '19 14:12 M3ssman

I see the need, too. But we already rely on core's OcrdExif info to pass into Tesseract. Shouldn't this override be available to all processors, @kba?

bertsky avatar Dec 05 '19 14:12 bertsky

Shouldn't this override be available to all processors, @kba?

Yes, but it's not trivial I'm afraid. My idea would be to allow overriding the pixel density values in the OcrdExif constructor but that gets called in workspace methods and model factory function. These would need to accept additional parameters to pass on to the OcrdExif constructor. Not complicated, just a bit convoluted, e.g. even more parameters to Workspace.image_from_page.

I do see the need though, so if the added complexity is fine by you, I'll create a PR in core.

kba avatar Dec 05 '19 19:12 kba

Yes, but it's not trivial I'm afraid. My idea would be to allow overriding the pixel density values in the OcrdExif constructor but that gets called in workspace methods and model factory function. These would need to accept additional parameters to pass on to the OcrdExif constructor. Not complicated, just a bit convoluted, e.g. even more parameters to Workspace.image_from_page.

But that only gets called by the processor again, so we are still where we started (adding a parameter for every single tool)!

Perhaps we should start adding other mechanisms that affect all processors equally (like the loglevel override):

  1. How about generic parameters (which are added to the tool json automatically)?
  2. Or extra CLI options (which are supported automatically when using ocrd.decorators)?
  3. Or even environment variables?
  4. Or even site-level configuration files (akin to ocrd_logging.py)?

Besides the manual DPI override, this would also allow supporting DPI meta-data validation with different levels of strictness.

Or supporting automatic workspace validation with different levels/sets of checks.

Or supporting processing with --force/--overwrite.

Or supporting processing on multiple CPUs/GPUs with given scalefactor.

Or supporting time constraints on different hierarchy levels.

Just saying!

bertsky avatar Dec 05 '19 22:12 bertsky

Anyway, IMO this issue should be transferred to core or spec, since it involves/affects more people/projects.

bertsky avatar Dec 06 '19 09:12 bertsky

At least for the DPI override, another processor-independent mechanism could be to have a dedicated processor earlier in the pipeline writing /PcGts/Page/@imageXResolution as an override to the image metadata parsed by OcrdExif – a processor like the one proposed here, only with an additional manual override – together with the behavioural changes in core.

@kba can you please transfer the issue?

bertsky avatar Dec 19 '19 08:12 bertsky

Besides the manual DPI override, this would also allow supporting DPI meta-data validation with different levels of strictness.

Or supporting automatic workspace validation with different levels/sets of checks.

Or supporting processing with --force/--overwrite.

Or supporting processing on multiple CPUs/GPUs with given scalefactor.

Or supporting time constraints on different hierarchy levels.

Just saying!

Or enabling/disabling METS caching (or even METS server).

bertsky avatar Jun 22 '22 12:06 bertsky