core icon indicating copy to clipboard operation
core copied to clipboard

workspace.download_file: do not change basename

Open bertsky opened this issue 3 years ago • 1 comments
trafficstars

I think this is caused by a change in assets: https://github.com/OCR-D/assets/commit/b12e5ebc12450bd70e9ec7a9d7afeb48f6201773, which was supposed to fix https://github.com/OCR-D/assets/issues/87, but does not work. Here is a debug log of what actually happens when copying the workspace to a temporary location:

DEBUG    ocrd.resolver.workspace_from_url:resolver.py:164 workspace_from_url
mets_basename='mets.xml'
mets_url='/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/mets.xml'
src_baseurl='/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data'
dst_dir='/tmp/test-ocrd-calamari'
DEBUG    ocrd.resolver.download_to_directory:resolver.py:49 directory=|/tmp/test-ocrd-calamari| url=|/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/mets.xml| basename=|mets.xml| if_exists=|skip| subdir=|None|
DEBUG    ocrd.resolver.download_to_directory:resolver.py:99 Copying file '/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/mets.xml' to '/tmp/test-ocrd-calamari/mets.xml'
DEBUG    ocrd.workspace.download_file:workspace.py:142 download_file <OcrdFile fileGrp=OCR-D-IMG ID=OCR-D-IMG_0001, mimetype=image/tiff, url=OCR-D-IMG/INPUT_0017.tif, local_filename=OCR-D-IMG/INPUT_0017.tif]/>  [_recursion_count=0]
DEBUG    ocrd.resolver.download_to_directory:resolver.py:49 directory=|/tmp/test-ocrd-calamari| url=|OCR-D-IMG/INPUT_0017.tif| basename=|OCR-D-IMG_0001.tif| if_exists=|skip| subdir=|OCR-D-IMG|
DEBUG    ocrd.workspace.download_file:workspace.py:158 First run of resolver.download_to_directory(OCR-D-IMG/INPUT_0017.tif) failed, try prepending baseurl '/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data': File path passed as 'url' to download_to_directory does not exist: OCR-D-IMG/INPUT_0017.tif
DEBUG    ocrd.workspace.download_file:workspace.py:142 download_file <OcrdFile fileGrp=OCR-D-IMG ID=OCR-D-IMG_0001, mimetype=image/tiff, url=/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/OCR-D-IMG/INPUT_0017.tif, local_filename=OCR-D-IMG/INPUT_0017.tif]/>  [_recursion_count=1]
DEBUG    ocrd.resolver.download_to_directory:resolver.py:49 directory=|/tmp/test-ocrd-calamari| url=|/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/OCR-D-IMG/INPUT_0017.tif| basename=|OCR-D-IMG_0001.tif| if_exists=|skip| subdir=|OCR-D-IMG|
DEBUG    ocrd.resolver.download_to_directory:resolver.py:99 Copying file '/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/OCR-D-IMG/INPUT_0017.tif' to '/tmp/test-ocrd-calamari/OCR-D-IMG/OCR-D-IMG_0001.tif'

So, essentially, Resolver.workspace_from_url undoes the non-standard path names when downloading, and subsequently the @imageFilename reference does not work (again).

@kba I suppose we could fix this in assets by using standard basenames, but it looks more like a bug in core to me.

Originally posted by @bertsky in https://github.com/OCR-D/ocrd_calamari/issues/73#issuecomment-1049788977

bertsky avatar Feb 24 '22 12:02 bertsky

IOW, when you have a partial clone of a local workspace, and you attempt to download some of its files, the following happens:

  1. chdir to the clone's Workspace.directory (the only reference to the original workspace is in Workspace.baseurl now)
  2. resolving the relative local URL fails
  3. "downloading" it fails
  4. a recursive attempt is started with the absolute local URL (from baseurl + url)
  5. chdir to the same directory again
  6. resolving the absolute local URL fails
  7. downloading it into ID+ext succeeds ← this changes the relative local URL though
  8. further down the line, for Workspace.resolve_image_exif or Workspace.image_from_page, via a PAGE-XML's @imageFilename the old relative local URL is requested
  9. it cannot be not found

Me feeling is that 7 is wrong – we should at least keep the old relative URL.

But what if some PAGE files in the workspace to be cloned even contain remote references for @imageFilename?

bertsky avatar Feb 24 '22 12:02 bertsky