TileDB-Py
TileDB-Py copied to clipboard
Improving Usability of ASCII Strings
- Previously
TILEDB_STRING_ASCIIdata was inconsistently displayed asbytes - There is a need to coerce to
streverywhere because (1) previously the resulting dataframe displayed ASCII as bytes with Pyarrow disabled but as str with Pyarrow enabled, and (2) this fix would remove the need to copy large amounts of data to convert back and forth in the TileDB-SingleCell Python API - Warning now emitted to the user to pass
dtype="ascii"for string dim types in lieu ofnp.bytes_ornp.str_for clarity. All three still work and under the hood usenp.str_andTILEDB_STRING_ASCII reprof string dimensions is now always displayed asdtype="ascii". Calling.dtype()will returnnp.dtype('U')as the return signature ofdtyperequires a Numpy dtype
This pull request has been linked to Shortcut Story #20965: Unicode-to-ASCII and ASCII-to-Unicode in TileDB-Py.
@nguyenv testing now -- thanks! :)
@nguyenv using SOMA tiledb://johnkerl-tiledb/Krasnow, obs array
https://cloud.tiledb.com/arrays/details/johnkerl-tiledb/351316c2-a1e4-400f-828d-758517517cfb/schema
The obs_id dim is STRING_ASCII but the string attrs show up in that web-UI schema presentation as CHAR
Also, I see:
>>> obs_uri = 'tiledb://johnkerl-tiledb/351316c2-a1e4-400f-828d-758517517cfb'
>>> O = tiledb.open(obs_uri)
>>> O.df[:]
nGene nUMI channel region ... sex tissue ethnicity development_stage
obs_id ...
P1_1_AAACCTGAGCGATAGC 959 3583 b'P1_1' b'normal' ... b'male' b'blood' b'unknown' b'75-year-old human stage'
P1_1_AAACCTGAGGCAAAGA 1168 3480 b'P1_1' b'normal' ... b'male' b'blood' b'unknown' b'75-year-old human stage'
P1_1_AAACCTGGTATTCGTG 2053 7838 b'P1_1' b'normal' ... b'male' b'blood' b'unknown' b'75-year-old human stage'
P1_1_AAACCTGGTGATGTCT 1099 3976 b'P1_1' b'normal' ... b'male' b'blood' b'unknown' b'75-year-old human stage'
P1_1_AAACGGGAGACAATAC 1225 4403 b'P1_1' b'normal' ... b'male' b'blood' b'unknown' b'75-year-old human stage'
... ... ... ... ... ... ... ... ... ...
P3_8_TTTGTCAGTCGGCATC 1782 6362 b'P3_8' b'normal' ... b'female' b'blood' b'unknown' b'51-year-old human stage'
P3_8_TTTGTCAGTCTAGTGT 1241 3505 b'P3_8' b'normal' ... b'female' b'blood' b'unknown' b'51-year-old human stage'
P3_8_TTTGTCATCACATACG 1234 3442 b'P3_8' b'normal' ... b'female' b'blood' b'unknown' b'51-year-old human stage'
P3_8_TTTGTCATCCTAGTGA 1150 2469 b'P3_8' b'normal' ... b'female' b'blood' b'unknown' b'51-year-old human stage'
P3_8_TTTGTCATCTTGCATT 2364 8348 b'P3_8' b'normal' ... b'female' b'blood' b'unknown' b'51-year-old human stage'
^^^^ strings ^^^^^^ still bytes
[65662 rows x 29 columns]
This is (as a reminder) in reference to https://github.com/single-cell-data/TileDB-SingleCell/issues/99 -- please see there for the reason that we cannot store these attributes as Unicode within the TileDB storage.
We have resolved the original problem that prompted this PR.
However, this branch still contains several features that may be important for usability such as consistent presentation of ASCII as str, explicitly displaying dtype="ascii" in Dim.repr and Attr.repr, and favoring the use of passing dtype="ascii" instead of np.str_ or np.bytes_ for dimensions.