TileDB-Py icon indicating copy to clipboard operation
TileDB-Py copied to clipboard

Improving Usability of ASCII Strings

Open nguyenv opened this issue 3 years ago • 4 comments
trafficstars

  • Previously TILEDB_STRING_ASCII data was inconsistently displayed as bytes
  • There is a need to coerce to str everywhere because (1) previously the resulting dataframe displayed ASCII as bytes with Pyarrow disabled but as str with Pyarrow enabled, and (2) this fix would remove the need to copy large amounts of data to convert back and forth in the TileDB-SingleCell Python API
  • Warning now emitted to the user to pass dtype="ascii" for string dim types in lieu of np.bytes_ or np.str_ for clarity. All three still work and under the hood use np.str_ and TILEDB_STRING_ASCII
  • repr of string dimensions is now always displayed as dtype="ascii". Calling .dtype() will return np.dtype('U') as the return signature of dtype requires a Numpy dtype

nguyenv avatar Aug 30 '22 15:08 nguyenv

@nguyenv testing now -- thanks! :)

johnkerl avatar Aug 30 '22 17:08 johnkerl

@nguyenv using SOMA tiledb://johnkerl-tiledb/Krasnow, obs array https://cloud.tiledb.com/arrays/details/johnkerl-tiledb/351316c2-a1e4-400f-828d-758517517cfb/schema

The obs_id dim is STRING_ASCII but the string attrs show up in that web-UI schema presentation as CHAR

Also, I see:

>>> obs_uri = 'tiledb://johnkerl-tiledb/351316c2-a1e4-400f-828d-758517517cfb'
>>> O = tiledb.open(obs_uri)
>>> O.df[:]
                       nGene  nUMI  channel     region  ...        sex    tissue   ethnicity           development_stage
obs_id                                                  ...
P1_1_AAACCTGAGCGATAGC    959  3583  b'P1_1'  b'normal'  ...    b'male'  b'blood'  b'unknown'  b'75-year-old human stage'
P1_1_AAACCTGAGGCAAAGA   1168  3480  b'P1_1'  b'normal'  ...    b'male'  b'blood'  b'unknown'  b'75-year-old human stage'
P1_1_AAACCTGGTATTCGTG   2053  7838  b'P1_1'  b'normal'  ...    b'male'  b'blood'  b'unknown'  b'75-year-old human stage'
P1_1_AAACCTGGTGATGTCT   1099  3976  b'P1_1'  b'normal'  ...    b'male'  b'blood'  b'unknown'  b'75-year-old human stage'
P1_1_AAACGGGAGACAATAC   1225  4403  b'P1_1'  b'normal'  ...    b'male'  b'blood'  b'unknown'  b'75-year-old human stage'
...                      ...   ...      ...        ...  ...        ...       ...         ...                         ...
P3_8_TTTGTCAGTCGGCATC   1782  6362  b'P3_8'  b'normal'  ...  b'female'  b'blood'  b'unknown'  b'51-year-old human stage'
P3_8_TTTGTCAGTCTAGTGT   1241  3505  b'P3_8'  b'normal'  ...  b'female'  b'blood'  b'unknown'  b'51-year-old human stage'
P3_8_TTTGTCATCACATACG   1234  3442  b'P3_8'  b'normal'  ...  b'female'  b'blood'  b'unknown'  b'51-year-old human stage'
P3_8_TTTGTCATCCTAGTGA   1150  2469  b'P3_8'  b'normal'  ...  b'female'  b'blood'  b'unknown'  b'51-year-old human stage'
P3_8_TTTGTCATCTTGCATT   2364  8348  b'P3_8'  b'normal'  ...  b'female'  b'blood'  b'unknown'  b'51-year-old human stage'
^^^^ strings                                 ^^^^^^ still bytes
[65662 rows x 29 columns]

This is (as a reminder) in reference to https://github.com/single-cell-data/TileDB-SingleCell/issues/99 -- please see there for the reason that we cannot store these attributes as Unicode within the TileDB storage.

johnkerl avatar Aug 30 '22 17:08 johnkerl

We have resolved the original problem that prompted this PR.

However, this branch still contains several features that may be important for usability such as consistent presentation of ASCII as str, explicitly displaying dtype="ascii" in Dim.repr and Attr.repr, and favoring the use of passing dtype="ascii" instead of np.str_ or np.bytes_ for dimensions.

nguyenv avatar Aug 30 '22 20:08 nguyenv