zarr-python
zarr-python copied to clipboard
Requirements of store data
Raising this issue to get an idea of what our requirements are of stores and what can be placed in them.
For instance in many cases we require Arrays to have an object_codec to allow storing object types and many stores would have difficulty with this data without explicit conversion to some sort of bytes-like object; however, we appear to be placing objects in a store as a test. Also we seem to expect stores to be easily comparable; however, this doesn't work if the store has NumPy ndarrays in it. ( https://github.com/zarr-developers/zarr/issues/348 )
Should we set some explicit requirements about what stores require? If so, what would those requirements be? Also how would we enforce them?
Thanks @jakirkham, very good idea to raise this.
FWIW I think as a minimum, a store:
- must implement the
MutableMappinginterface - must support keys that are ASCII strings (provided as
stron PY3 and also asstron PY2) - must normalise keys as described in the zarr format spec v2
- must reject keys that are invalid as defined in the zarr format spec v2
- must support setting values that are any buffer-like object
- must return values as any buffer-like object
- must be pickleable
- must (should?) implement equals comparison (
__eq__) and compare true to another store of the same type holding the same data
...where a "buffer-like object" is an object exporting (PY2 and PY3) the new-style buffer interface or (PY2 only) the old-style buffer interface.
Optionally, a store:
- may reject any key that is not an ascii string
- may reject a value that is not a buffer-like object
- may (should?) reject a value that is buffer-like but of object dtype
- may implement case-insensitive comparison of keys (not ideal, but hard to avoid for storage on some file-systems)
- should implement
listdir(),rmdir(),rename(),getsize()methods where possible
I know this doesn't directly answer your question about tests involving object arrays, but maybe gives a bit more context to that discussion.
At least this helps to clarify for me that we shouldn't really be using dict as a store class, as this does not handle the requirements regarding key normalisation. So your proposed move to use DictStore instead of dict as the default store class for arrays seems good to me.
Thanks for clearly outlining this clearly.
Based on our discussion here and in ( https://github.com/zarr-developers/zarr/issues/348 ) am leaning towards hardening the requirement that DictStore must hold bytes ( https://github.com/zarr-developers/zarr/pull/350 ) (or at least bytes-like data) and that Array should use DictStore for in-memory storage to enforce this requirement ( https://github.com/zarr-developers/zarr/pull/351 ).
Something to think about in the larger context is how we validate a store. Should we have a function that is able to run through a store and make sure it is spec conforming?
Based on our discussion here and in ( #348 https://github.com/zarr-developers/zarr/issues/348 ) am leaning towards hardening the requirement that DictStore must hold bytes ( #350 https://github.com/zarr-developers/zarr/pull/350 ) (or at least bytes-like data) and that Array should use DictStore for in-memory storage to enforce this requirement ( #351 https://github.com/zarr-developers/zarr/pull/351 ).
+1
Something to think about in the larger context is how we validate a store. Should we have a function that is able to run through a store and make sure it is spec conforming?
Interesting question. We have a class zarr.tests.test_storage.StoreTests which can be sub-classed to create a set of unit tests for a store class. Is that enough, or do we want something that could be run more dynamically?
That's an interesting idea. Was thinking about it in the context of validating stores for use with Array (though maybe this could/should be thought of more generally). How dynamic probably depends on when/how it is run. For instance we could register valid stores, in which case this would happen once when they are registered. Stores could be pre-registered as well (i.e. builtin ones). Alternatively we could do it whenever the store is used (maybe caching valid store types). Could also just leave this as a user facing function and trust users have tested their store with it. This latter case may lead to defensive coding on our part though.
That's an interesting idea. Was thinking about it in the context of validating stores for use with
Array(though maybe this could/should be thought of more generally). How dynamic probably depends on when/how it is run. For instance we could register valid stores, in which case this would happen once when they are registered. Stores could be pre-registered as well (i.e. builtin ones). Alternatively we could do it whenever the store is used (maybe caching valid store types). Could also just leave this as a user facing function and trust users have tested their store with it. This latter case may lead to defensive coding on our part though.
FWIW I'd be happy if we provided developer support so store class developers can thoroughly test a store class implementation, but then at runtime trust that users provide something sensible as a store. We're already pretty defensive, e.g., we normalise all storage paths above the storage layer. We could also check the result of the chunk encoding pipeline is an object supporting the buffer protocol, before passing on to storage. So i.e., guarantee that we'll provide valid keys and values to the storage layer. But after that I think we can just trust stores to do something reasonable with keys and values.
I just reviewed this thread. I'm thinking ahead towards questions of language inter-operability, and I'm concerned that our definition of a store is too python-centric.
While it should always be possible to implement a custom store by following the requirements above, perhaps we should also define a spec for a store that does not depend on python concepts such as mutable-mapping, pickleable, etc. This would make it easier to implement zarr in other languages.
Hi Ryan, FWIW the format spec does try to remain language-agnostic and talks in an abstract way about key/value stores. I think the discussion in this particular issue has been scoped more to the specifics of how store classes are implemented in Python and what is expected of them there.
Focusing just on the format spec for the moment, do you think that needs to be made more language-agnostic, or otherwise needs any improvement or clarifications?
On Wed, 9 Jan 2019 at 10:34, Ryan Abernathey [email protected] wrote:
I just reviewed this thread. I'm thinking ahead towards questions of language inter-operability, and I'm concerned that our definition of a store is too python-centric.
While it should always be possible to implement a custom store by following the requirements above, perhaps we should also define a spec for a store that does not depend on python concepts such as mutable-mapping, pickleable, etc. This would make it easier to implement zarr in other languages.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr/issues/349#issuecomment-452649594, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QgLvmxwKSrUsOrL34K5gRLZFKbEfks5vBcWtgaJpZM4Y9kXZ .
--
Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery University of Oxford Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: [email protected] Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: @alimanfoo https://twitter.com/alimanfoo
Please feel free to resend your email and/or contact me by other means if you need an urgent reply.
Thanks for the clarification. I see how this thread is specific to python implementations.
I guess I worry that the spec is too vague with regards to the implementation of the key value store, and the methods that can be used to query it:
A Zarr array can be stored in any storage system that provides a key/value interface, where a key is an ASCII string and a value is an arbitrary sequence of bytes, and the supported operations are read (get the sequence of bytes associated with a given key), write (set the sequence of bytes associated with a given key) and delete (remove a key/value pair).
In terms of operations, "Read", "write", and "delete" doesn't seem like enumeration of operations a store must support. When implementing a store, you also need at least some form of "list" operation; otherwise zarr can't discover what is in the store. (The exception is consolidated metadata stores.) In fact, you have to implement a MutableMapping, which has five methods: __getitem__, __setitem__, __delitem__, __iter__, and __len__.)
More generally, how do we ensure that DirectoryStore, ZipStore, or any of the myriad cloud stores that have been developed can truly be read from different implementations of zarr? I wonder if it would be worth explicitly defining a spec for certain commonly used stores that gives more detail about the implementation choices that have already been made in the zarr python code.
Thanks Ryan, good points. We certainly could be more explicit about the set of operations that a storage system must support, and make sure we include everything (e.g., listing all keys). We could also state the optional operations, which are not strictly necessary but allow for some optimisations or additional features, like being able to list all the keys that are children of some hierarchy path (the listdir() method in Python implementations).
We could do this in a language-independent way but still make it clear and concrete how this corresponds to specific operations supported by a file system or a cloud object service or whatever.
I think we could also do this as an update to the format spec, without requiring a new spec version, as these would be clarifications of the existing spec.
On Thu, 10 Jan 2019, 10:55 Ryan Abernathey <[email protected] wrote:
Thanks for the clarification. I see how this thread is specific to python implementations.
I guess I worry that the spec is too vague with regards to the implementation of the key value store, and the methods that can be used to query it:
A Zarr array can be stored in any storage system that provides a key/value interface, where a key is an ASCII string and a value is an arbitrary sequence of bytes, and the supported operations are read (get the sequence of bytes associated with a given key), write (set the sequence of bytes associated with a given key) and delete (remove a key/value pair).
In terms of operations, "Read", "write", and "delete" doesn't seem like enumeration of operations a store must support. When implementing a store, you also need at least some form of "list" operation; otherwise zarr can't discover what is in the store. (The exception is consolidated metadata stores.) In fact, you have to implement a MutableMapping https://docs.python.org/3/library/collections.abc.html, which has five methods: getitem, setitem, delitem, iter, and len.)
More generally, how do we ensure that DirectoryStore, ZipStore, or any of the myriad cloud stores that have been developed can truly be read from different implementations of zarr? I wonder if it would be worth explicitly defining a spec for certain commonly used stores that gives more detail about the implementation choices that have already been made in the zarr python code.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr/issues/349#issuecomment-453054878, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QldD8wCmVJiYP8yEB07T3b7V7C8Zks5vBxw5gaJpZM4Y9kXZ .
In PR ( https://github.com/zarr-developers/zarr-python/pull/789 ) we added a BaseStore class, which addresses some of these basic needs of Stores
Subsequent discussion around the v3 spec and storing standardized data from libraries handles other concerns raised here
Were there any other things still needing to be addressed here?
cc @joshmoore @grlee77
Closing now that the v3 spec goes into much more detail on the subject.