core icon indicating copy to clipboard operation
core copied to clipboard

make_file_id: not correct for --overwrite

Open bertsky opened this issue 3 years ago • 2 comments
trafficstars

If neither the input fileGrp nor the page ID is directly contained in the file ID of the input file, then make_file_id determines the index of that file in the input fileGrp, and calculates a new ID based on that index and the output fileGrp. But if that ID already exists, the index is incremented until a free one becomes available.

Alas,

  1. the list returned by OcrdMets.find_files is not sorted by page ID (but mets:file element order), so that index may deviate from the page ID.
  2. the increment strategy is wrong in combination with --overwrite, because it will create multiple files for the same page ID – only the first one of which will be considered by follow-up processors (so nothing is actually overwritten; and the new files will be ignored entirely)

To address both issues, I suggest calculating the output ID based on the (output fileGrp and) page ID of the input file.

bertsky avatar Mar 22 '22 22:03 bertsky

@kba this is a very nasty bug that prevents --overwrite for me in a lot of cases (and makes repairing the METS afterwards very hard). RFC

bertsky avatar May 13 '22 14:05 bertsky

RFC

Sry, this one got lost in the shuffle.

To address both issues, I suggest calculating the output ID based on the (output fileGrp and) page ID of the input file.

Agreed in general, though we need a fallback for the case that a file has no pageId - which should not happen in real life data but is not strictly required.

I'll prepare a PR.

kba avatar May 13 '22 16:05 kba

Was hopefully finally fixed in #861 and released in 2.39.0.

kba avatar Oct 25 '22 14:10 kba