core
core copied to clipboard
workspace validator: non-URI path
From a workspace validate I got:
<report valid="false">
<error>METS has no unique identifier</error>
<error>Validation aborted with exception: Traceback (most recent call last):
File "/data/ocr-d/ocrd_all/venv38/lib/python3.8/site-packages/ocrd_validators/workspace_validator.py", line 149, in _validate
self._validate_mets_files()
File "/data/ocr-d/ocrd_all/venv38/lib/python3.8/site-packages/ocrd_validators/workspace_validator.py", line 302, in _validate_mets_files
scheme = f.url[0:f.url.index(':')]
ValueError: substring not found
</error>
</report>
The url in question simply was a relative file name, which obviously makes the URI validator crash.
This is very problematic for two reasons:
- before ocrd differentiated between LOCTYPE=URL and OTHER, we created lots of data (including GT) with URL, despite being local paths – this now broken
- in this case, the data was just created by current ocrd itself – via
ocrd workspace add, because that implementation sets bothlocal_filenameandurlto the local path
Yeah, this happens with a workspace built with ocrd workspace itself...
add seems to make, for example, this:
<mets:file ID="XXX" MIMETYPE="image/jpeg">
<mets:FLocat xlink:href="OCR-D-IMG/2812988X_1862-09-02_001.jpg" LOCTYPE="OTHER" OTHERLOCTYPE="FILE"/>
<mets:FLocat xlink:href="OCR-D-IMG/2812988X_1862-09-02_001.jpg" LOCTYPE="URL"/>
</mets:file>
- Removing the FLocats with
LOCTYPE="URL" - and making the image filename referenced in the PAGE XML consistent (I imported the XML)
fixes the validation at least (in the sense that it doesn't choke on exceptions itself).
addalso created this structMap:
that's also what I witnessed as prime problem in https://github.com/OCR-D/ocrd_tesserocr/issues/201. We need more diagnostics why and exactly when this is happening.
But this is a separate issue (has nothing to do with the validator).
But this is a separate issue (has nothing to do with the validator).
True, I'll open another GitHub issue for that, if you didn't already.
I'll open another GitHub issue for that, if you didn't already.
No, please do!
No, please do!
Just for the sake of completeness: https://github.com/OCR-D/core/issues/1195