nixpy
nixpy copied to clipboard
Variable or fixed length string data
I've been looking into supporting string type data in DataArrays (see old issue #82). I think this is doable but there's one decision to consider and I would like some feedback.
We can support variable length or fixed length strings as data in DataArrays, or maybe both in some special way, and here are the arguments I can think of for each.
Variable length
- Obviously this is more flexible. Once a DataArray is created it can be edited freely.
- We already use variable length strings in Property for string type metadata and Dimension for labels.
-
Potential issue: The special dtype we create to support variable length strings is simply of type
object
(np.dtype('O')
). This isn't an issue now since no other data type we use matches this type. We could just assume that any data we read back that's of typeO
is a variable length string.
Fixed length
- We could restrict strings in DataArray to be fixed length, the same way any other type of DataArray can be reshaped but not resized. A string in this instance would be useful for storing vectors of character data and not necessarily text.
- The stored data type would not be lost on read. We would get back a
np.dtype('|Sn')
, wheren
is the length of the string.
I'd love to hear everyone's thoughts on this.
Dunno, my gut tells me it should be variable length strings, just for the sake of memory saving. Actually, we could use this feature in the relacs context.
NIX (C++, not sure about the python bindings) should support vlen strings already.
So vlen it is.
I also found the proper solution to the potential issue. The base type of the special dtype is stored in the metadata
attribute.
>>> vlenstr = h5py.special_dtype(vlen=str)
>>> vlenstr
dtype('O')
>>> vlenstr.metadata
mappingproxy({'vlen': str})
>>> vlenstr.metadata["vlen"]
str