zarr-specs Node name case sensitivity

Currently the v3.0 core protocol draft section on node names states the following:

Note that node names are used to form storage keys, and that some storage systems will perform a case-insensitive comparison of storage keys during retrieval. Therefore, within a hierarchy, all nodes within a set of sibling nodes must have a name that is unique under case-insensitive comparison. E.g., the names “foo” and “FOO” are not allowed for sibling nodes.

This constraint, however, is problematic in scenarios where nodes are being created in parallel. This is because requiring that sibling nodes have a unique name under case-insensitive comparison would require that a check is made before node creation to ensure no name collisions, and any such check would require synchronisation of node creation at least within the same group.

Currently a design goal for the spec is to avoid constraints which require operations to be synchronised, thus allowing multiple parallel processes to work simultaneously, including creating nodes within the same hierarchy. This avoids implementation complexity (no need for locks or other synchronisation mechanisms) and also works better on stores with eventual consistency behaviour.

I suggest we remove this constraint from the core protocol. I.e., implementations of the core protocol are not expected to check for case-insensitive node name collisions.

However, instead we add a usage note, which provides some guidance for implementing applications using a zarr protocol implementation, including recommending that applications avoid creating sibling nodes with case-insensitive name collisions wherever that is possible.

I.e., we push this to the application layer.

Mar 31 '20 22:03 alimanfoo

However, instead we add a usage note, which provides some guidance for implementing applications using a zarr protocol implementation, including recommending that applications avoid creating sibling nodes with case-insensitive name collisions wherever that is possible.

Then do you want to add on the spec that stores must be able to handle case sensitive names ? In the spec it says Note that node names are used to form storage keys it does not say how, and that might push for example filesystems to not just use a key but another scheme.

May 05 '20 18:05 Carreau

Hi @Carreau,

The subsection on storage keys describes how keys are constructed from node paths.

It is then up to a store spec to decide how to use these keys and store data.

For example, the file system store spec defines how to translate keys into file paths. I expect this will be a commonly-used store, and so the question about case (in)sensitivity mainly applies to this store. But it is not a fixed choice, alternative store specs could be proposed which takes a different approach.

Hope that makes sense.

May 06 '20 12:05 alimanfoo

Yes, thanks that what I understood, I just want to make sure decisions are not taken on implementations leaking. The fact that the directory store happens to be case (in)sensitive and how to do key retrieval can as far as I can tell be worked around pretty easily).

from hashlib import md5
from base64 import b32encode
import os

def f(path):
    return os.sep.join( 
        x+'-'+md5(x.encode()).hexdigest()[:5] for x in path.split('/')
    ).lower()

print(f('FOO/bar')) #foo-90189/bar-37b51
print(f('foo/bar')) #foo-acbd1/bar-37b51

def g(path):
    return os.sep.join( 
        x+'-'+''.join(['1' if c.isupper() else '0' for c in x]) for x in path.split('/')
    ).lower()

print(g('FoO/bar')) #foo-101/bar-000
print(g('foo/bar')) #foo-000/bar-000


def h(path):
    return os.sep.join( 
        b32encode(x.encode()).decode() for x in path.split('/')
    ).lower()

print(h('FoO/bar')) #izxu6===/mjqxe===
print(h('foo/bar')) #mzxw6===/mjqxe===

Now each of those have different tradeoff on readability/length of path, key enumeration... etc.

Now the problem I see with requiring avoiding if possible, is that you may end up with an code implementation on one system that happened to rely on some case insensitivity of the underlying store and will be broken on another system.

If the choice were mine, I would probably impose case sensitivity for a store spec.

This is also not a complete unrelated discussion from unicode as casefolding for non-ascii and case folding it complex (it may even depends on user locale, i'm not sure) (https://unicode.org/faq/casemap_charprop.html have some crazy things.)

from a performance perspective you may even be able to just do bytes comparison as well, which I'm guessing compiled languages would be happy(ier) about.

May 06 '20 22:05 Carreau

Thanks @Carreau, very helpful.

If the choice were mine, I would probably impose case sensitivity for a store spec.

FWIW I think this is the most natural thing to do.

To put it into the terms of the v3 protocol spec, I think this would mean stating in the section on the store interface that we expect stores to perform a case sensitive comparison of keys for the get, set and delete operations.

What then should the file system store spec do? That spec defines a translation from storage keys to file paths that relies on the underlying file system being case sensitive, and so should not be used on a case insensitive file system. Maybe that's OK, and we just add a big red note saying "only use on case sensitive file systems".

I guess we could always define another file system storage spec which is designed to work on case insensitive file systems, and so carries some of the extra overhead of longer and/or more obscure file paths.

The two storage specs could live alongside each other, and people could choose to implement and use which they think best.

Thoughts welcome.

May 07 '20 12:05 alimanfoo

To put it into the terms of the v3 protocol spec, I think this would mean stating in the section on the store interface that we expect stores to perform a case sensitive comparison of keys for the get, set and delete operations.

probably list keys as well.

What then should the file system store spec do? That spec defines a translation from storage keys to file paths that relies on the underlying file system being case sensitive, and so should not be used on a case insensitive file system.

Maybe that's OK, and we just add a big red note saying "only use on case sensitive file systems".

I think that might be tough.

And yes, I would go with a different file storage spec, and is also one of the reasons I opened the issue about detecting the kind of store independently of the zarr version in order to dispatch.

Now I'm' pretty new to the project, I'm trying to give my best advice with what I believe is the direction of the project you are envisaging. I might be completely wrong.

May 07 '20 15:05 Carreau

Jim Pivarski suggest we escape uppercase characters, and this would keep readability.

Oct 09 '20 14:10 Carreau

The v3 core spec currently states that

Node names are case sensitive, e.g., the names “foo” and “FOO” are not identical

which should resolve that at least on the spec level, right? Should this issue still be kept open? I guess it will become important once stores which are case insensitive are added, then escaping uppercase characters might be the proper solution.

Nov 24 '22 13:11 jstriebel

This requirement would preclude use of the filesystem store (without escaping) with normal local filesystems on macos and windows. Instead it should be defined by the store/implementation.

Nov 24 '22 14:11 jbms

See also #175, where I added notes about case insensitivity for Windows and MacOS. However, escaping might still be an option. Should that be added as an option to the current FileSystemStore? (There's no such thing as to specify configs for a store in the metadata atm, will open an issue about that soon).

Nov 24 '22 14:11 jstriebel

Escaping could be a solution, but it would essentially amount to a separate incompatible store, and would sacrifice readability of the keys.

Nov 24 '22 14:11 jbms

it would essentially amount to a separate incompatible store

Good point, so it could also be added later as an additional store. IMO this is good enough to go forward with v3.

Nov 24 '22 14:11 jstriebel

PS: Just noticed that escaping might be done via a storage transformer.

Nov 24 '22 15:11 jstriebel

zarr-specs zarr-specs copied to clipboard

Node name case sensitivity

zarr-specs
zarr-specs copied to clipboard