core icon indicating copy to clipboard operation
core copied to clipboard

fix make_file_id

Open bertsky opened this issue 3 years ago • 4 comments
trafficstars

First we need to have a test for --overwrite that fails…

bertsky avatar May 13 '22 20:05 bertsky

If I grok this correctly, the problem is that make_file_id is not aware of the overwrite_mode of the workspace and should behave differently. It should not try to avoid clashes because that would lead to inconsistent state, i.e. more files with different IDs instead of overwriting existing files.

Yes. But also, conversely, running against an existing grp without --overwrite should produce clashes when it involves the existing page IDs.

(Of course, that's nothing we can ensure at the level of make_file_id anymore – see my mirror review

One solution could be to add a new kwarg overwrite to the make_file_id method. Not ideal because that requires all the procesors to be adapted and can lead to easy-to-miss inconsistencies across processors.

Yes. Possible but not a good design.

Since overwrite is a state of the OcrdWorkspace class and there is no link back from OcrdMets to the workspace it represents, we cannot go OcrdFile -> OcrdMets -> OcrdWorkspace - but maybe this should be possible. We could add a reference from OcrdMets to OcrdWorkspace in the latter's __init__.

Then we could check OcrdFile.mets.workspace.overwrite_mode and skip the loop to create unique IDs.

Yes, but should OcrdMets really depend on OcrdWorkspace?

My best idea for the moment still is:

But we probably also have to enforce this outside of make_file_id, probably in Workspace.add_file: When adding a PAGE-XML file for a pageId that already exists in that fileGrp,

  • if overwrite, then re-use that fileId
  • otherwise raise an exception

bertsky avatar May 17 '22 13:05 bertsky

But we probably also have to enforce this outside of make_file_id, probably in Workspace.add_file: When adding a PAGE-XML file for a pageId that already exists in that fileGrp,

  • if overwrite, then re-use that fileId
  • otherwise raise an exception

When OcrdWorkspace.add_file is called in a workspace in overwrite_mode, it will set the force kwarg when doing the call to OcrdMets.add_file. And that in turn checks whether a file with the same ID already exists and if it does, changes it to match the info on the file to add.

IIUC you would replace

mets_file = next(self.find_files(ID=ID), None)

with

mets_file = next(self.find_files(fileGrp=fileGrp, pageId=pageId)

and if that is found and force is True, then change the found file accordingly but retain its old ID.

kba avatar May 18 '22 15:05 kba

IIUC you would replace

mets_file = next(self.find_files(ID=ID), None)

with

mets_file = next(self.find_files(fileGrp=fileGrp, pageId=pageId)

and if that is found and force is True, then change the found file accordingly but retain its old ID.

No. Like I said, keep OcrdMets.add_file unchanged, but before delegating to it, in Workspace.add_file, do:

    if kwargs['mimetype'] == MIMETYPE_PAGE:
        # only one PAGE per page
        file_ = next(self.mets.find_files(fileGrp=file_grp, pageId=kwargs['page_id'], mimetype=kwargs['mimetype']), None)
        if file_:
            kwargs['ID'] = file_.ID # let overwrite_mode decide what to do (re-use / raise)
    ret = self.mets.add_file(file_grp, **kwargs)

I'm not sure about file_.local_filename though: OcrdMets.add_file will force re-using the existing file path, but (unless we provide the content= anyway and don't care about the exact filename afterwards) we want our new file path to replace it.

bertsky avatar May 18 '22 18:05 bertsky

Can we get this moving @kba? #825 is still quite a blow you know...

bertsky avatar Jun 10 '22 10:06 bertsky

You said re-review, which I did, but now you already merged without my suggestions…

bertsky avatar Oct 21 '22 11:10 bertsky

You said re-review, which I did, but now you already merged without my suggestions…

I'll amend.

kba avatar Oct 21 '22 12:10 kba

You said re-review, which I did, but now you already merged without my suggestions…

I'll amend.

Done in https://github.com/OCR-D/core/tree/861-re-review. Many thanks for the documentation fixes.

kba avatar Oct 21 '22 12:10 kba