core
core copied to clipboard
fix make_file_id
First we need to have a test for --overwrite that fails…
If I grok this correctly, the problem is that
make_file_idis not aware of theoverwrite_modeof the workspace and should behave differently. It should not try to avoid clashes because that would lead to inconsistent state, i.e. more files with different IDs instead of overwriting existing files.
Yes. But also, conversely, running against an existing grp without --overwrite should produce clashes when it involves the existing page IDs.
(Of course, that's nothing we can ensure at the level of make_file_id anymore – see my mirror review
One solution could be to add a new kwarg
overwriteto themake_file_idmethod. Not ideal because that requires all the procesors to be adapted and can lead to easy-to-miss inconsistencies across processors.
Yes. Possible but not a good design.
Since
overwriteis a state of theOcrdWorkspaceclass and there is no link back fromOcrdMetsto the workspace it represents, we cannot goOcrdFile->OcrdMets->OcrdWorkspace- but maybe this should be possible. We could add a reference from OcrdMets to OcrdWorkspace in the latter's__init__.Then we could check
OcrdFile.mets.workspace.overwrite_modeand skip the loop to create unique IDs.
Yes, but should OcrdMets really depend on OcrdWorkspace?
My best idea for the moment still is:
But we probably also have to enforce this outside of
make_file_id, probably inWorkspace.add_file: When adding a PAGE-XML file for a pageId that already exists in that fileGrp,
- if
overwrite, then re-use that fileId- otherwise raise an exception
But we probably also have to enforce this outside of
make_file_id, probably inWorkspace.add_file: When adding a PAGE-XML file for a pageId that already exists in that fileGrp,
- if
overwrite, then re-use that fileId- otherwise raise an exception
When OcrdWorkspace.add_file is called in a workspace in overwrite_mode, it will set the force kwarg when doing the call to OcrdMets.add_file. And that in turn checks whether a file with the same ID already exists and if it does, changes it to match the info on the file to add.
IIUC you would replace
mets_file = next(self.find_files(ID=ID), None)
with
mets_file = next(self.find_files(fileGrp=fileGrp, pageId=pageId)
and if that is found and force is True, then change the found file accordingly but retain its old ID.
IIUC you would replace
mets_file = next(self.find_files(ID=ID), None)with
mets_file = next(self.find_files(fileGrp=fileGrp, pageId=pageId)and if that is found and
forceisTrue, then change the found file accordingly but retain its old ID.
No. Like I said, keep OcrdMets.add_file unchanged, but before delegating to it, in Workspace.add_file, do:
if kwargs['mimetype'] == MIMETYPE_PAGE:
# only one PAGE per page
file_ = next(self.mets.find_files(fileGrp=file_grp, pageId=kwargs['page_id'], mimetype=kwargs['mimetype']), None)
if file_:
kwargs['ID'] = file_.ID # let overwrite_mode decide what to do (re-use / raise)
ret = self.mets.add_file(file_grp, **kwargs)
I'm not sure about file_.local_filename though: OcrdMets.add_file will force re-using the existing file path, but (unless we provide the content= anyway and don't care about the exact filename afterwards) we want our new file path to replace it.
Can we get this moving @kba? #825 is still quite a blow you know...
You said re-review, which I did, but now you already merged without my suggestions…
You said re-review, which I did, but now you already merged without my suggestions…
I'll amend.
You said re-review, which I did, but now you already merged without my suggestions…
I'll amend.
Done in https://github.com/OCR-D/core/tree/861-re-review. Many thanks for the documentation fixes.