core
core copied to clipboard
ocrd workspace bulk-add: auto fileid should include filegrp
The normal convention for new file IDs in OCR-D is (due to make_file_id implementation) the pattern grp + '_' + page. But the current bulk-add behaviour automatically assigns file_id based on the input pattern directly:
https://github.com/OCR-D/core/blob/71d295ac1fccbeb4164e230bd584e1920b9ab3c8/ocrd/ocrd/cli/workspace.py#L304-L305
Not only does this usually omit the fileGrp, it also needlessly restricts IDs to only use ASCII letters (where in fact all Unicode letters should be allowed).
Also, in case the file_path is distinct from src_path (i.e. if the input expression is not a filename glob but a complex pattern), I believe the ID should be taken from the latter.
(The same is true for auto MIME type BTW.)