core icon indicating copy to clipboard operation
core copied to clipboard

workspace validator: non-URI path

Open bertsky opened this issue 1 year ago • 7 comments

From a workspace validate I got:

<report valid="false">
  <error>METS has no unique identifier</error>
  <error>Validation aborted with exception: Traceback (most recent call last):
  File "/data/ocr-d/ocrd_all/venv38/lib/python3.8/site-packages/ocrd_validators/workspace_validator.py", line 149, in _validate
    self._validate_mets_files()
  File "/data/ocr-d/ocrd_all/venv38/lib/python3.8/site-packages/ocrd_validators/workspace_validator.py", line 302, in _validate_mets_files
    scheme = f.url[0:f.url.index(':')]
ValueError: substring not found
</error>
</report>

The url in question simply was a relative file name, which obviously makes the URI validator crash.

This is very problematic for two reasons:

  1. before ocrd differentiated between LOCTYPE=URL and OTHER, we created lots of data (including GT) with URL, despite being local paths – this now broken
  2. in this case, the data was just created by current ocrd itself – via ocrd workspace add, because that implementation sets both local_filename and url to the local path

bertsky avatar Jan 30 '24 16:01 bertsky

Yeah, this happens with a workspace built with ocrd workspace itself...

add seems to make, for example, this:

      <mets:file ID="XXX" MIMETYPE="image/jpeg">
        <mets:FLocat xlink:href="OCR-D-IMG/2812988X_1862-09-02_001.jpg" LOCTYPE="OTHER" OTHERLOCTYPE="FILE"/>
        <mets:FLocat xlink:href="OCR-D-IMG/2812988X_1862-09-02_001.jpg" LOCTYPE="URL"/>
      </mets:file>

mikegerber avatar Feb 28 '24 18:02 mikegerber

  1. Removing the FLocats with LOCTYPE="URL"
  2. and making the image filename referenced in the PAGE XML consistent (I imported the XML)

fixes the validation at least (in the sense that it doesn't choke on exceptions itself).

mikegerber avatar Feb 28 '24 18:02 mikegerber

add also created this structMap:

that's also what I witnessed as prime problem in https://github.com/OCR-D/ocrd_tesserocr/issues/201. We need more diagnostics why and exactly when this is happening.

But this is a separate issue (has nothing to do with the validator).

bertsky avatar Feb 29 '24 13:02 bertsky

But this is a separate issue (has nothing to do with the validator).

True, I'll open another GitHub issue for that, if you didn't already.

mikegerber avatar Mar 01 '24 12:03 mikegerber

I'll open another GitHub issue for that, if you didn't already.

No, please do!

bertsky avatar Mar 01 '24 12:03 bertsky

No, please do!

Just for the sake of completeness: https://github.com/OCR-D/core/issues/1195

mikegerber avatar Mar 01 '24 16:03 mikegerber