String encoding in structured metadata arrays
#3091 introduced returning a numpy structured array from a compatible metadata buffer. The StructCodec allows the specification of string encoding in the schema, however numpy only supports bytes and utf-32 in S and U dtypes. #3091 therefore returns structured arrays with only the S dtype for each string field.
At the cost of a copy and some shuffling, it would be possible to decode these to the users specification in the schema using numpy.char.decode then reassigning the encoded string array back into the structured array. This could be implemented as an option with a boolean flag decode_strings to structured_array_from_buffer, however as ts.X_metadata is a property this couldn't be set when retrieving the array, so either an additional property on the ts (ts.X_metadata_string_decode?) or leaving the user to do this via the lower-level code.
Does the numpy 2.0 StringDtype alter this logic much? I'm considering implementing access to the ancestral_state etc arrays via some low-level C code using the StringDtype (which we can conditionally compile for numpy 2). I figure it's OK to have new features depend on numpy 2, once old code continues to work, and string handling is something that numpy 2 should be significantly better at.
From the docs I can't see that StringDtype changes anything about the encoding.
OK, can we park strings for a bit so and just raise an error if the schema as a string field? I want to play with this myself on the SC2 data to make sure it all scales.
I'm not sure we need to error - the current code deals with strings, it's just that they have the S dtype and give a bytes object when you do something like table.node_metadata['sting_name'][26].
Right but we'd like that to be a string rather than bytes. Raising an error makes sure we don't forget this and release accidentally.