bagitspec icon indicating copy to clipboard operation
bagitspec copied to clipboard

Unicode Normalization?

Open edsu opened this issue 10 years ago • 4 comments

The BagIt specification lets you specify that UTF-8 encoding be used in tag manifests. But it doesn't appear to assume a particular normalization form.

I have a problem where files are bagged and transferred from an OS X filesystem (which uses NFD) and are copied to Linux (which uses NFC). During validation the NFC normalized form from the filesystem is compared against the NFD normalized form from the manifest and validation fails.

Should a particular normalization form (NFC?) be assumed for unicode encodings?

edsu avatar Jan 21 '16 19:01 edsu

Perhaps the spec doesn't need to specify as long as applications normalize both the path from the manifest and the path from the filesystem when comparing?

edsu avatar Jan 21 '16 19:01 edsu

I think trying to mandate a normalization form would be hard but perhaps there should be a prominent guide for implementors? We could follow with enhancement requests for the known open source projects.

acdha avatar Jan 21 '16 19:01 acdha

:+1: mandating seems hard (fruitless), but a note to implementors to pick one when comparing the filesystem paths against the manifest filenames seems like a good idea?

edsu avatar Jan 21 '16 19:01 edsu

@edsu how does the proposed recommendation in https://github.com/loc-rdc/bagitspec/pull/1/ and especially https://github.com/loc-rdc/bagitspec/pull/1/commits/f898aff4ee89c441ee6931f708d942551ad549a4 sound?

acdha avatar Feb 02 '17 21:02 acdha