nixpy icon indicating copy to clipboard operation
nixpy copied to clipboard

Variable or fixed length string data

Open achilleas-k opened this issue 8 years ago • 3 comments

I've been looking into supporting string type data in DataArrays (see old issue #82). I think this is doable but there's one decision to consider and I would like some feedback.

We can support variable length or fixed length strings as data in DataArrays, or maybe both in some special way, and here are the arguments I can think of for each.

Variable length

  • Obviously this is more flexible. Once a DataArray is created it can be edited freely.
  • We already use variable length strings in Property for string type metadata and Dimension for labels.
  • Potential issue: The special dtype we create to support variable length strings is simply of type object (np.dtype('O')). This isn't an issue now since no other data type we use matches this type. We could just assume that any data we read back that's of type O is a variable length string.

Fixed length

  • We could restrict strings in DataArray to be fixed length, the same way any other type of DataArray can be reshaped but not resized. A string in this instance would be useful for storing vectors of character data and not necessarily text.
  • The stored data type would not be lost on read. We would get back a np.dtype('|Sn'), where n is the length of the string.

I'd love to hear everyone's thoughts on this.

achilleas-k avatar Nov 21 '16 15:11 achilleas-k

Dunno, my gut tells me it should be variable length strings, just for the sake of memory saving. Actually, we could use this feature in the relacs context.

jgrewe avatar Nov 21 '16 16:11 jgrewe

NIX (C++, not sure about the python bindings) should support vlen strings already.

gicmo avatar Nov 21 '16 16:11 gicmo

So vlen it is.

I also found the proper solution to the potential issue. The base type of the special dtype is stored in the metadata attribute.

>>> vlenstr = h5py.special_dtype(vlen=str)
>>> vlenstr
dtype('O')

>>> vlenstr.metadata
mappingproxy({'vlen': str})

>>> vlenstr.metadata["vlen"]
str

achilleas-k avatar Nov 22 '16 10:11 achilleas-k