core icon indicating copy to clipboard operation
core copied to clipboard

workspace clone: always copy local-only file paths

Open bertsky opened this issue 2 years ago • 3 comments

When you ocrd workspace clone /some/path/to/mets.xml (without the indiscriminate download option) on a workspace which contains local files, the following happens:

  1. a mets:file with remote FLocat will still keep its (now defunct) local FLocat
  2. a mets:file with only local path FLocat will not be copied

IMO, either workspace clone from a relative path should either always copy all local files, or at least the ones in 2 (and removing the local refs in 1).

Copying of the content files itself could also attempt to do CoW (zero-cost) copies, in case the filesystem permits it.

bertsky avatar Dec 11 '23 16:12 bertsky

Also:

When you ocrd workspace clone --download /some/path/to/mets.xml (with the download option) on a workspace which contains local files, the following happens:

  1. a mets:file with only local path FLocat will get an additional remote FLocat with an absolute path (combining the baseurl prefix with the relative path).

bertsky avatar May 24 '24 00:05 bertsky

@kba this is a severe problem IMO.

bertsky avatar May 24 '24 10:05 bertsky

Another example of this (trying to get ocrd_tesserocr tests to work on v3):

    @fixture
    def workspace_kant_binarized(tmpdir):
        initLogging()
        with pushd_popd(tmpdir):
>           yield Resolver().workspace_from_url(METS_KANT_BINARIZED, dst_dir=tmpdir, download=True)

test/conftest.py:15: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../core/src/ocrd/resolver.py:229: in workspace_from_url
    workspace.download_file(f)
../core/src/ocrd/workspace.py:222: in download_file
    f.local_filename = self.resolver.download_to_directory(self.directory, f.url, subdir=f.fileGrp, basename=basename)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
E               FileNotFoundError: File path passed as 'url' to download_to_directory does not exist: 'OCR-D-GT-WORD/INPUT_0017.xml

So because METS_KANT_BINARIZED is only a local workspace to "download" from, the baseurl mechanism does not work. So at the time the download is tried, there is already no information on where the absolute path was.

bertsky avatar Aug 25 '24 01:08 bertsky