core workspace validator: non-URI path

From a workspace validate I got:

<report valid="false">
  <error>METS has no unique identifier</error>
  <error>Validation aborted with exception: Traceback (most recent call last):
  File "/data/ocr-d/ocrd_all/venv38/lib/python3.8/site-packages/ocrd_validators/workspace_validator.py", line 149, in _validate
    self._validate_mets_files()
  File "/data/ocr-d/ocrd_all/venv38/lib/python3.8/site-packages/ocrd_validators/workspace_validator.py", line 302, in _validate_mets_files
    scheme = f.url[0:f.url.index(':')]
ValueError: substring not found
</error>
</report>

The url in question simply was a relative file name, which obviously makes the URI validator crash.

This is very problematic for two reasons:

before ocrd differentiated between LOCTYPE=URL and OTHER, we created lots of data (including GT) with URL, despite being local paths – this now broken
in this case, the data was just created by current ocrd itself – via ocrd workspace add, because that implementation sets both local_filename and url to the local path

Jan 30 '24 16:01 bertsky

Yeah, this happens with a workspace built with ocrd workspace itself...

add seems to make, for example, this:

      <mets:file ID="XXX" MIMETYPE="image/jpeg">
        <mets:FLocat xlink:href="OCR-D-IMG/2812988X_1862-09-02_001.jpg" LOCTYPE="OTHER" OTHERLOCTYPE="FILE"/>
        <mets:FLocat xlink:href="OCR-D-IMG/2812988X_1862-09-02_001.jpg" LOCTYPE="URL"/>
      </mets:file>

Feb 28 '24 18:02 mikegerber

Removing the FLocats with LOCTYPE="URL"
and making the image filename referenced in the PAGE XML consistent (I imported the XML)

fixes the validation at least (in the sense that it doesn't choke on exceptions itself).

Feb 28 '24 18:02 mikegerber

add also created this structMap:

that's also what I witnessed as prime problem in https://github.com/OCR-D/ocrd_tesserocr/issues/201. We need more diagnostics why and exactly when this is happening.

But this is a separate issue (has nothing to do with the validator).

Feb 29 '24 13:02 bertsky

But this is a separate issue (has nothing to do with the validator).

True, I'll open another GitHub issue for that, if you didn't already.

Mar 01 '24 12:03 mikegerber

I'll open another GitHub issue for that, if you didn't already.

No, please do!

Mar 01 '24 12:03 bertsky

No, please do!

Just for the sake of completeness: https://github.com/OCR-D/core/issues/1195

Mar 01 '24 16:03 mikegerber