core
core copied to clipboard
Additional parameter for custom resolution
Hello,
please add DPI-Parameters to enable to enforce custom resolution when using tesseract
Tesseract CLI
--dpi 470
ocrd-tesserocr
dpi: 470
I see the need, too. But we already rely on core's OcrdExif info to pass into Tesseract. Shouldn't this override be available to all processors, @kba?
Shouldn't this override be available to all processors, @kba?
Yes, but it's not trivial I'm afraid. My idea would be to allow overriding the pixel density values in the OcrdExif constructor but that gets called in workspace methods and model factory function. These would need to accept additional parameters to pass on to the OcrdExif constructor. Not complicated, just a bit convoluted, e.g. even more parameters to Workspace.image_from_page.
I do see the need though, so if the added complexity is fine by you, I'll create a PR in core.
Yes, but it's not trivial I'm afraid. My idea would be to allow overriding the pixel density values in the OcrdExif constructor but that gets called in workspace methods and model factory function. These would need to accept additional parameters to pass on to the OcrdExif constructor. Not complicated, just a bit convoluted, e.g. even more parameters to
Workspace.image_from_page.
But that only gets called by the processor again, so we are still where we started (adding a parameter for every single tool)!
Perhaps we should start adding other mechanisms that affect all processors equally (like the loglevel override):
- How about generic parameters (which are added to the tool json automatically)?
- Or extra CLI options (which are supported automatically when using ocrd.decorators)?
- Or even environment variables?
- Or even site-level configuration files (akin to
ocrd_logging.py)?
Besides the manual DPI override, this would also allow supporting DPI meta-data validation with different levels of strictness.
Or supporting automatic workspace validation with different levels/sets of checks.
Or supporting processing with --force/--overwrite.
Or supporting processing on multiple CPUs/GPUs with given scalefactor.
Or supporting time constraints on different hierarchy levels.
Just saying!
Anyway, IMO this issue should be transferred to core or spec, since it involves/affects more people/projects.
At least for the DPI override, another processor-independent mechanism could be to have a dedicated processor earlier in the pipeline writing /PcGts/Page/@imageXResolution as an override to the image metadata parsed by OcrdExif – a processor like the one proposed here, only with an additional manual override – together with the behavioural changes in core.
@kba can you please transfer the issue?
Besides the manual DPI override, this would also allow supporting DPI meta-data validation with different levels of strictness.
Or supporting automatic workspace validation with different levels/sets of checks.
Or supporting processing with
--force/--overwrite.Or supporting processing on multiple CPUs/GPUs with given scalefactor.
Or supporting time constraints on different hierarchy levels.
Just saying!
Or enabling/disabling METS caching (or even METS server).