core icon indicating copy to clipboard operation
core copied to clipboard

ocrd workspace: file paths should be made relative to self.directory

Open bertsky opened this issue 1 year ago • 3 comments

On the CLI, the user expects that paths are resolved once (on the input side) and from then onwards everything is automatically done relative to the workspace (whenever possible). So if a file path is absolute, then its workspace prefix gets removed regarding METS FLocat. Or if it is relative, then it is relative to the CWD, but gets resolved relative to the workspace.

Unfortunately, that's not what we have implemented yet, esp. if you combine with an overall -d path/to/workspace (so CWD is not equal workspace directory).

For example, in add_file, we currently have: https://github.com/OCR-D/core/blob/299301003688c14a2824d865804f3a69480cdbbd/src/ocrd/cli/workspace.py#L221-L222

Meaning, the workspace directory gets added, with no regard to the caller's CWD.

Another example is in bulk_add, where we do: https://github.com/OCR-D/core/blob/299301003688c14a2824d865804f3a69480cdbbd/src/ocrd/cli/workspace.py#L375-L381

Here, again, no difference is made between CWD and self.directory.

In the latter case (bulk-add), what makes things worse is that the glob pattern gets treated verbatim if empty: https://github.com/OCR-D/core/blob/299301003688c14a2824d865804f3a69480cdbbd/src/ocrd/cli/workspace.py#L333-L335

So if one tries to compensate for the above path resolving issue by passing a glob relative to the workspace, then that itself will break the METS with a verbatim asterisk path entry:

ocrd workspace -d nd1967-12-14_var1/ bulk-add -r "TEXTRACT_PAGE/(?P<page>.*)[.]xml" -G TEXTRACT_PAGE -g '{{ page }}' "TEXTRACT_PAGE/*.xml"
INFO ocrd.cli.workspace.bulk-add - [   1/1] TEXTRACT_PAGE/*.xml

bertsky avatar Apr 17 '24 15:04 bertsky

Same is true for the --mets-server-url argument: relative paths don't work (not relative to the workspace and not relative to CWD).

(Again, the user expectation is that whatever resolves at startup time from the user's CWD should be kept.)

bertsky avatar Sep 02 '24 10:09 bertsky

Same is true for the --mets-server-url argument: relative paths don't work (not relative to the workspace and not relative to CWD).

(Again, the user expectation is that whatever resolves at startup time from the user's CWD should be kept.)

This also just bit me in the Processing Server context: passing a relative path to the --mets argument of a /processor/run/... post yields a HTTP 500, caused by

File "/build/core/src/ocrd_network/processing_server.py", line 474, in validate_and_forward_job_to_network_agent
  mets_server_url = self.deployer.start_uds_mets_server(ws_dir_path=ws_dir_path)
File "/build/core/src/ocrd_network/runtime_data/deployer.py", line 157, in start_uds_mets_server
  pid = OcrdMetsServer.create_process(mets_server_url=str(mets_server_url), ws_dir_path=str(ws_dir_path), log_file=str(log_file))
File "/build/core/src/ocrd/mets_server.py", line 427, in create_process
  raise RuntimeError(f"Mets server starting failed. See {log_file} for errors")
RuntimeError: Mets server starting failed. See /data/logs/mets_servers/Sounding_Spirit_sample_pages.log for errors

And that log in turns says:

  File "/build/core/src/ocrd/cli/workspace.py", line 924, in workspace_serve_start
    workspace=Workspace(ctx.resolver, directory=ctx.directory, mets_basename=ctx.mets_basename),
  File "/build/core/src/ocrd/workspace.py", line 102, in __init__
    mets = OcrdMets(filename=self.mets_target)
  File "/build/core/src/ocrd_models/ocrd_mets.py", line 78, in __init__
    super().__init__(**kwargs)
  File "/build/core/src/ocrd_models/ocrd_xml_base.py", line 35, in __init__
    raise Exception('File does not exist: %s' % filename)
Exception: File does not exist: /data/Sounding-Spirit-sample-pages/Sounding-Spirit-sample-pages/mets.xml

So clearly, the directory name was added twice.

The call chain was: https://github.com/OCR-D/core/blob/8ac528f948d0888653117c07d2f5fdfb29769cad/src/ocrd_network/processing_server.py#L473-L474

https://github.com/OCR-D/core/blob/8ac528f948d0888653117c07d2f5fdfb29769cad/src/ocrd_network/runtime_data/deployer.py#L145-L157

https://github.com/OCR-D/core/blob/8ac528f948d0888653117c07d2f5fdfb29769cad/src/ocrd/mets_server.py#L418-L420

Pinging @kba @MehmedGIT because this severely affects deployments. (The only workaround would be using absolute path names.)

bertsky avatar Jul 04 '25 00:07 bertsky

Same is true for the --mets-server-url argument: relative paths don't work (not relative to the workspace and not relative to CWD).

Ideally, the internal low-level methods should still be absolute path names to avoid passing the workspace context around and keep testing of separate methods simple. I am unsure how difficult it will be to include both relative and absolute paths on the CLI level for the Processing Server and/or the Workspace methods. However, I am open to discussions to try to find a solution.

MehmedGIT avatar Jul 08 '25 09:07 MehmedGIT