tskit icon indicating copy to clipboard operation
tskit copied to clipboard

String encoding in structured metadata arrays

Open benjeffery opened this issue 9 months ago • 5 comments

#3091 introduced returning a numpy structured array from a compatible metadata buffer. The StructCodec allows the specification of string encoding in the schema, however numpy only supports bytes and utf-32 in S and U dtypes. #3091 therefore returns structured arrays with only the S dtype for each string field.

At the cost of a copy and some shuffling, it would be possible to decode these to the users specification in the schema using numpy.char.decode then reassigning the encoded string array back into the structured array. This could be implemented as an option with a boolean flag decode_strings to structured_array_from_buffer, however as ts.X_metadata is a property this couldn't be set when retrieving the array, so either an additional property on the ts (ts.X_metadata_string_decode?) or leaving the user to do this via the lower-level code.

benjeffery avatar Mar 07 '25 11:03 benjeffery

Does the numpy 2.0 StringDtype alter this logic much? I'm considering implementing access to the ancestral_state etc arrays via some low-level C code using the StringDtype (which we can conditionally compile for numpy 2). I figure it's OK to have new features depend on numpy 2, once old code continues to work, and string handling is something that numpy 2 should be significantly better at.

jeromekelleher avatar Mar 07 '25 11:03 jeromekelleher

From the docs I can't see that StringDtype changes anything about the encoding.

benjeffery avatar Mar 07 '25 12:03 benjeffery

OK, can we park strings for a bit so and just raise an error if the schema as a string field? I want to play with this myself on the SC2 data to make sure it all scales.

jeromekelleher avatar Mar 07 '25 12:03 jeromekelleher

I'm not sure we need to error - the current code deals with strings, it's just that they have the S dtype and give a bytes object when you do something like table.node_metadata['sting_name'][26].

benjeffery avatar Mar 07 '25 12:03 benjeffery

Right but we'd like that to be a string rather than bytes. Raising an error makes sure we don't forget this and release accidentally.

jeromekelleher avatar Mar 07 '25 13:03 jeromekelleher