zarr-specs
zarr-specs copied to clipboard
Node name case sensitivity
Currently the v3.0 core protocol draft section on node names states the following:
Note that node names are used to form storage keys, and that some storage systems will perform a case-insensitive comparison of storage keys during retrieval. Therefore, within a hierarchy, all nodes within a set of sibling nodes must have a name that is unique under case-insensitive comparison. E.g., the names “foo” and “FOO” are not allowed for sibling nodes.
This constraint, however, is problematic in scenarios where nodes are being created in parallel. This is because requiring that sibling nodes have a unique name under case-insensitive comparison would require that a check is made before node creation to ensure no name collisions, and any such check would require synchronisation of node creation at least within the same group.
Currently a design goal for the spec is to avoid constraints which require operations to be synchronised, thus allowing multiple parallel processes to work simultaneously, including creating nodes within the same hierarchy. This avoids implementation complexity (no need for locks or other synchronisation mechanisms) and also works better on stores with eventual consistency behaviour.
I suggest we remove this constraint from the core protocol. I.e., implementations of the core protocol are not expected to check for case-insensitive node name collisions.
However, instead we add a usage note, which provides some guidance for implementing applications using a zarr protocol implementation, including recommending that applications avoid creating sibling nodes with case-insensitive name collisions wherever that is possible.
I.e., we push this to the application layer.
However, instead we add a usage note, which provides some guidance for implementing applications using a zarr protocol implementation, including recommending that applications avoid creating sibling nodes with case-insensitive name collisions wherever that is possible.
Then do you want to add on the spec that stores must be able to handle case sensitive names ? In the spec it says Note that node names are used to form storage keys it does not say how, and that might push for example filesystems to not just use a key but another scheme.
Hi @Carreau,
The subsection on storage keys describes how keys are constructed from node paths.
It is then up to a store spec to decide how to use these keys and store data.
For example, the file system store spec defines how to translate keys into file paths. I expect this will be a commonly-used store, and so the question about case (in)sensitivity mainly applies to this store. But it is not a fixed choice, alternative store specs could be proposed which takes a different approach.
Hope that makes sense.
Yes, thanks that what I understood, I just want to make sure decisions are not taken on implementations leaking. The fact that the directory store happens to be case (in)sensitive and how to do key retrieval can as far as I can tell be worked around pretty easily).
from hashlib import md5
from base64 import b32encode
import os
def f(path):
return os.sep.join(
x+'-'+md5(x.encode()).hexdigest()[:5] for x in path.split('/')
).lower()
print(f('FOO/bar')) #foo-90189/bar-37b51
print(f('foo/bar')) #foo-acbd1/bar-37b51
def g(path):
return os.sep.join(
x+'-'+''.join(['1' if c.isupper() else '0' for c in x]) for x in path.split('/')
).lower()
print(g('FoO/bar')) #foo-101/bar-000
print(g('foo/bar')) #foo-000/bar-000
def h(path):
return os.sep.join(
b32encode(x.encode()).decode() for x in path.split('/')
).lower()
print(h('FoO/bar')) #izxu6===/mjqxe===
print(h('foo/bar')) #mzxw6===/mjqxe===
Now each of those have different tradeoff on readability/length of path, key enumeration... etc.
Now the problem I see with requiring avoiding if possible, is that you may end up with an code implementation on one system that happened to rely on some case insensitivity of the underlying store and will be broken on another system.
If the choice were mine, I would probably impose case sensitivity for a store spec.
This is also not a complete unrelated discussion from unicode as casefolding for non-ascii and case folding it complex (it may even depends on user locale, i'm not sure) (https://unicode.org/faq/casemap_charprop.html have some crazy things.)
from a performance perspective you may even be able to just do bytes comparison as well, which I'm guessing compiled languages would be happy(ier) about.
Thanks @Carreau, very helpful.
If the choice were mine, I would probably impose case sensitivity for a store spec.
FWIW I think this is the most natural thing to do.
To put it into the terms of the v3 protocol spec, I think this would mean stating in the section on the store interface that we expect stores to perform a case sensitive comparison of keys for the get, set and delete operations.
What then should the file system store spec do? That spec defines a translation from storage keys to file paths that relies on the underlying file system being case sensitive, and so should not be used on a case insensitive file system. Maybe that's OK, and we just add a big red note saying "only use on case sensitive file systems".
I guess we could always define another file system storage spec which is designed to work on case insensitive file systems, and so carries some of the extra overhead of longer and/or more obscure file paths.
The two storage specs could live alongside each other, and people could choose to implement and use which they think best.
Thoughts welcome.
To put it into the terms of the v3 protocol spec, I think this would mean stating in the section on the store interface that we expect stores to perform a case sensitive comparison of keys for the get, set and delete operations.
probably list keys as well.
What then should the file system store spec do? That spec defines a translation from storage keys to file paths that relies on the underlying file system being case sensitive, and so should not be used on a case insensitive file system.
Maybe that's OK, and we just add a big red note saying "only use on case sensitive file systems".
I think that might be tough.
And yes, I would go with a different file storage spec, and is also one of the reasons I opened the issue about detecting the kind of store independently of the zarr version in order to dispatch.
Now I'm' pretty new to the project, I'm trying to give my best advice with what I believe is the direction of the project you are envisaging. I might be completely wrong.
Jim Pivarski suggest we escape uppercase characters, and this would keep readability.
The v3 core spec currently states that
Node names are case sensitive, e.g., the names “foo” and “FOO” are not identical
which should resolve that at least on the spec level, right? Should this issue still be kept open? I guess it will become important once stores which are case insensitive are added, then escaping uppercase characters might be the proper solution.
This requirement would preclude use of the filesystem store (without escaping) with normal local filesystems on macos and windows. Instead it should be defined by the store/implementation.
See also #175, where I added notes about case insensitivity for Windows and MacOS. However, escaping might still be an option. Should that be added as an option to the current FileSystemStore? (There's no such thing as to specify configs for a store in the metadata atm, will open an issue about that soon).
Escaping could be a solution, but it would essentially amount to a separate incompatible store, and would sacrifice readability of the keys.
it would essentially amount to a separate incompatible store
Good point, so it could also be added later as an additional store. IMO this is good enough to go forward with v3.
PS: Just noticed that escaping might be done via a storage transformer.
See also #201.
@jstriebel: what was the status here?
https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#node-names
We basically leave it up to the underlying store.