core
core copied to clipboard
make_file_id: not correct for --overwrite
If neither the input fileGrp nor the page ID is directly contained in the file ID of the input file, then make_file_id determines the index of that file in the input fileGrp, and calculates a new ID based on that index and the output fileGrp. But if that ID already exists, the index is incremented until a free one becomes available.
Alas,
- the list returned by
OcrdMets.find_filesis not sorted by page ID (butmets:fileelement order), so that index may deviate from the page ID. - the increment strategy is wrong in combination with
--overwrite, because it will create multiple files for the same page ID – only the first one of which will be considered by follow-up processors (so nothing is actually overwritten; and the new files will be ignored entirely)
To address both issues, I suggest calculating the output ID based on the (output fileGrp and) page ID of the input file.
@kba this is a very nasty bug that prevents --overwrite for me in a lot of cases (and makes repairing the METS afterwards very hard). RFC
RFC
Sry, this one got lost in the shuffle.
To address both issues, I suggest calculating the output ID based on the (output fileGrp and) page ID of the input file.
Agreed in general, though we need a fallback for the case that a file has no pageId - which should not happen in real life data but is not strictly required.
I'll prepare a PR.
Was hopefully finally fixed in #861 and released in 2.39.0.