TileDB-Py
TileDB-Py copied to clipboard
sparse from_pandas ambiguity
Hey there
I'm using tiledb.from_pandas to create a tiledb array from a pandas dataframe. My question regards the sparse parameter of the from_pandas function.
My dataframe mostly consists of "0" values. Should I convert them to np.nan's ?
Hi @royassis,
I would drop the rows containing the 0 values:
import tiledb
import numpy as np
import pandas as pd
# dataframe with a lot of zero values
data = pd.DataFrame(np.random.randint(0, 2, size=(10)))
print(data)
# drop the non-zero values
data = data[data[0] != 0]
uri = "example_pd_sparse.tdb"
# remove the array if it already exists
vfs = tiledb.VFS()
if vfs.is_dir(uri):
vfs.remove_dir(uri)
# create a tiledb array from a pandas dataframe
tiledb.from_pandas(uri, data, sparse=True)
# read the tiledb array
with tiledb.open(uri, "r") as A:
print(A.df[:])
print(data.equals(A.df[:]))
Example run:
(tiledb-3.10) vivian@mangonada:~/tiledb-bugs$ python pd-df-zeros.py
original data with zeros
0
0 0
1 0
2 0
3 0
4 1
5 0
6 1
7 0
8 1
9 1
resulting sparse array
0
4 1
6 1
8 1
9 1
True
Let me know if this answers your question.
Thanks.
Hey :)
Actually my df has multiple columns.
I think I understand now. tiledb.from_pandas
does recognize Pandas nullable dtypes (note the pd.UInt8Dtype()
) and will accordingly set the TileDB attribute to be nullable (note the resulting schema). You can set nullable values with np.nan
, pd.NA
, or None
in the dataframe and that will be reflected in the TileDB array.
import tiledb
import numpy as np
import pandas as pd
print("original data with zeros")
data = pd.DataFrame(np.random.randint(0, 2, size=(10, 3)))
print(data)
# convert to a pandas nullable dtype and replace the 0s with nullable value
data = data.astype(pd.UInt8Dtype())
data = data.replace(0, None)
uri = "example_pd_sparse.tdb"
# remove the array if it already exists
vfs = tiledb.VFS()
if vfs.is_dir(uri):
vfs.remove_dir(uri)
# create a tiledb array from a pandas dataframe
tiledb.from_pandas(uri, data, sparse=True)
# read the tiledb array
with tiledb.open(uri, "r") as A:
print("resulting array")
print(A.df[:])
print(data.equals(A.df[:]))
original data with zeros
0 1 2
0 1 0 1
1 1 0 0
2 0 1 0
3 0 0 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1
8 1 0 0
9 1 0 0
resulting array
ArraySchema(
domain=Domain(*[
Dim(name='__tiledb_rows', domain=(0, 9), tile=9, dtype='int64', filters=FilterList([ZstdFilter(level=-1), ])),
]),
attrs=[
Attr(name='0', dtype='uint8', var=False, nullable=True, filters=FilterList([ZstdFilter(level=-1), ])),
Attr(name='1', dtype='uint8', var=False, nullable=True, filters=FilterList([ZstdFilter(level=-1), ])),
Attr(name='2', dtype='uint8', var=False, nullable=True, filters=FilterList([ZstdFilter(level=-1), ])),
],
cell_order='row-major',
tile_order='row-major',
capacity=10000,
sparse=True,
allows_duplicates=True,
)
0 1 2
0 1 <NA> 1
1 1 <NA> <NA>
2 <NA> 1 <NA>
3 <NA> <NA> <NA>
4 <NA> <NA> 1
5 1 <NA> 1
6 <NA> 1 1
7 1 1 1
8 1 <NA> <NA>
9 1 <NA> <NA>
True
However, since all your coordinates (__tiledb_rows
) contain data, you won't have any of the benefits of a sparse array, and you might even be better off creating this as dense.
If all the columns in your dataframe are the same datatype, I'm curious if you are looking for something more like this? It does not use tiledb.from_pandas
. We create a schema ourselves with two dimensions and one attribute. Convert the dataframe to a NumPy array and grab the nonzero coords and data to write to the TileDB array. You'll result in an array that is sparsely populated.
import tiledb
import numpy as np
import pandas as pd
print("original data with zeros")
original_data = pd.DataFrame(np.random.randint(0, 2, size=(10, 3)))
print(original_data)
uri = "example_normal_sparse.tdb"
# remove the array if it already exists
vfs = tiledb.VFS()
if vfs.is_dir(uri):
vfs.remove_dir(uri)
arr = original_data.to_numpy()
dom = tiledb.Domain(
tiledb.Dim("col", domain=(0, 9), dtype=np.uint8),
tiledb.Dim("row", domain=(0, 2), dtype=np.uint8),
)
att = tiledb.Attr(dtype=np.uint8)
schema = tiledb.ArraySchema(domain=dom, attrs=(att,), sparse=True)
tiledb.Array.create(uri, schema)
with tiledb.open(uri, "w") as A:
A[np.nonzero(arr)] = arr[np.nonzero(arr)]
with tiledb.open(uri, "r") as A:
print("resulting array")
print(A.schema)
print(A.df[:])
original data with zeros
0 1 2
0 0 0 0
1 1 0 1
2 0 1 0
3 0 0 0
4 0 1 1
5 0 1 0
6 0 1 0
7 1 1 0
8 0 0 1
9 0 0 1
resulting array
ArraySchema(
domain=Domain(*[
Dim(name='col', domain=(0, 9), tile=10, dtype='uint8'),
Dim(name='row', domain=(0, 2), tile=3, dtype='uint8'),
]),
attrs=[
Attr(name='', dtype='uint8', var=False, nullable=False),
],
cell_order='row-major',
tile_order='row-major',
capacity=10000,
sparse=True,
allows_duplicates=False,
)
col row
0 1 0 1
1 1 2 1
2 2 1 1
3 4 1 1
4 4 2 1
5 5 1 1
6 6 1 1
7 7 0 1
8 7 1 1
9 8 2 1
10 9 2 1
Ohh I understand now. I was hoping the from_pandas handles the conversion by itself.
I got genomic data with thousands of columns.
Thanks @royassis for checking TileDB out. If you could provide some information about your use case and the schema of the raw data, we can get back to you with an optimized array schema and optimized ingestion scripts. We have a lot of experience with genomics data. Cheers!
Hey @stavrospapadopoulos I would love that, i'll check first with my colleagues on what data I can share.
Hey again @stavrospapadopoulos
We work with many formats, but mainly with .h5ad. We have datasets with up to 30,000 genes and up to millions of cell_barcodes. Our data is usually very sparse.
We have many projects, one of them is a small streamlit app that reads data from .h5ad files and do some visualizations with Scanpy. As time went by and we added a few larger datasets we looked for a solution that will give better performance, we found tiledb.
A usual query will be to pull the values from of all cell barcodes but only from a small number of genes (no more than 10). In the query we will also pull some feature data (e.g cell type), do some aggregation per gene for each cell type and visualize the results.
When working with pandas a usual dataframe has genes + some added feature as the column names. Cell barcodes as the index of the dataframe. The data is mostly sprase (many zeros but all rows). And we want to do aggregations on a small subset of genes but for all barcodes.
Oh, you are working with single-cell data. You are in luck! We are working closely with the Chan Zuckerberg Initiative to define a unified single-cell data model and API, that is interoperable with Seurat, Bioconductor and ScanPy. Please check the API spec and ongoing TileDB implementation of the spec below:
- https://github.com/single-cell-data/SOMA
- https://github.com/single-cell-data/TileDB-SingleCell
@nguyenv @stavrospapadopoulos Thank you both.
I got more questions to ask, what is the best place for that ?
You can join our Slack community or post questions on our forum.