core ocrd workspace bulk-add: auto fileid should include filegrp

ocrd workspace bulk-add: auto fileid should include filegrp

Open bertsky opened this issue 3 years ago • 0 comments

The normal convention for new file IDs in OCR-D is (due to make_file_id implementation) the pattern grp + '_' + page. But the current bulk-add behaviour automatically assigns file_id based on the input pattern directly:

https://github.com/OCR-D/core/blob/71d295ac1fccbeb4164e230bd584e1920b9ab3c8/ocrd/ocrd/cli/workspace.py#L304-L305

Not only does this usually omit the fileGrp, it also needlessly restricts IDs to only use ASCII letters (where in fact all Unicode letters should be allowed).

Also, in case the file_path is distinct from src_path (i.e. if the input expression is not a filename glob but a complex pattern), I believe the ID should be taken from the latter.

(The same is true for auto MIME type BTW.)

Oct 12 '22 09:10 bertsky

core core copied to clipboard

ocrd workspace bulk-add: auto fileid should include filegrp

core
core copied to clipboard