TileDB-Py icon indicating copy to clipboard operation
TileDB-Py copied to clipboard

sparse from_pandas ambiguity

Open royassis opened this issue 2 years ago • 10 comments

Hey there

I'm using tiledb.from_pandas to create a tiledb array from a pandas dataframe. My question regards the sparse parameter of the from_pandas function.

My dataframe mostly consists of "0" values. Should I convert them to np.nan's ?

royassis avatar Jun 11 '22 09:06 royassis

Hi @royassis,

I would drop the rows containing the 0 values:

import tiledb
import numpy as np
import pandas as pd

# dataframe with a lot of zero values
data = pd.DataFrame(np.random.randint(0, 2, size=(10)))
print(data)

# drop the non-zero values
data = data[data[0] != 0]

uri = "example_pd_sparse.tdb"

# remove the array if it already exists
vfs = tiledb.VFS()
if vfs.is_dir(uri):
    vfs.remove_dir(uri)

# create a tiledb array from a pandas dataframe
tiledb.from_pandas(uri, data, sparse=True)

# read the tiledb array
with tiledb.open(uri, "r") as A:
    print(A.df[:])
    print(data.equals(A.df[:]))

Example run:

(tiledb-3.10) vivian@mangonada:~/tiledb-bugs$ python pd-df-zeros.py
original data with zeros
   0
0  0
1  0
2  0
3  0
4  1
5  0
6  1
7  0
8  1
9  1
resulting sparse array
   0
4  1
6  1
8  1
9  1
True

Let me know if this answers your question.

Thanks.

nguyenv avatar Jun 13 '22 13:06 nguyenv

Hey :)

Actually my df has multiple columns.

royassis avatar Jun 13 '22 14:06 royassis

I think I understand now. tiledb.from_pandas does recognize Pandas nullable dtypes (note the pd.UInt8Dtype()) and will accordingly set the TileDB attribute to be nullable (note the resulting schema). You can set nullable values with np.nan, pd.NA, or None in the dataframe and that will be reflected in the TileDB array.

import tiledb
import numpy as np
import pandas as pd

print("original data with zeros")
data = pd.DataFrame(np.random.randint(0, 2, size=(10, 3)))
print(data)

# convert to a pandas nullable dtype and replace the 0s with nullable value
data = data.astype(pd.UInt8Dtype())
data = data.replace(0, None)

uri = "example_pd_sparse.tdb"

# remove the array if it already exists
vfs = tiledb.VFS()
if vfs.is_dir(uri):
    vfs.remove_dir(uri)

# create a tiledb array from a pandas dataframe
tiledb.from_pandas(uri, data, sparse=True)

# read the tiledb array
with tiledb.open(uri, "r") as A:
    print("resulting array")
    print(A.df[:])
    print(data.equals(A.df[:]))
original data with zeros
   0  1  2
0  1  0  1
1  1  0  0
2  0  1  0
3  0  0  0
4  0  0  1
5  1  0  1
6  0  1  1
7  1  1  1
8  1  0  0
9  1  0  0
resulting array
ArraySchema(
  domain=Domain(*[
    Dim(name='__tiledb_rows', domain=(0, 9), tile=9, dtype='int64', filters=FilterList([ZstdFilter(level=-1), ])),
  ]),
  attrs=[
    Attr(name='0', dtype='uint8', var=False, nullable=True, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='1', dtype='uint8', var=False, nullable=True, filters=FilterList([ZstdFilter(level=-1), ])),
    Attr(name='2', dtype='uint8', var=False, nullable=True, filters=FilterList([ZstdFilter(level=-1), ])),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=10000,
  sparse=True,
  allows_duplicates=True,
)
      0     1     2
0     1  <NA>     1
1     1  <NA>  <NA>
2  <NA>     1  <NA>
3  <NA>  <NA>  <NA>
4  <NA>  <NA>     1
5     1  <NA>     1
6  <NA>     1     1
7     1     1     1
8     1  <NA>  <NA>
9     1  <NA>  <NA>
True

However, since all your coordinates (__tiledb_rows) contain data, you won't have any of the benefits of a sparse array, and you might even be better off creating this as dense.

If all the columns in your dataframe are the same datatype, I'm curious if you are looking for something more like this? It does not use tiledb.from_pandas. We create a schema ourselves with two dimensions and one attribute. Convert the dataframe to a NumPy array and grab the nonzero coords and data to write to the TileDB array. You'll result in an array that is sparsely populated.

import tiledb
import numpy as np
import pandas as pd

print("original data with zeros")
original_data = pd.DataFrame(np.random.randint(0, 2, size=(10, 3)))
print(original_data)

uri = "example_normal_sparse.tdb"
# remove the array if it already exists
vfs = tiledb.VFS()
if vfs.is_dir(uri):
    vfs.remove_dir(uri)

arr = original_data.to_numpy()

dom = tiledb.Domain(
    tiledb.Dim("col", domain=(0, 9), dtype=np.uint8),
    tiledb.Dim("row", domain=(0, 2), dtype=np.uint8),
)
att = tiledb.Attr(dtype=np.uint8)
schema = tiledb.ArraySchema(domain=dom, attrs=(att,), sparse=True)

tiledb.Array.create(uri, schema)

with tiledb.open(uri, "w") as A:
    A[np.nonzero(arr)] = arr[np.nonzero(arr)]

with tiledb.open(uri, "r") as A:
    print("resulting array")
    print(A.schema)
    print(A.df[:])

original data with zeros
   0  1  2
0  0  0  0
1  1  0  1
2  0  1  0
3  0  0  0
4  0  1  1
5  0  1  0
6  0  1  0
7  1  1  0
8  0  0  1
9  0  0  1
resulting array
ArraySchema(
  domain=Domain(*[
    Dim(name='col', domain=(0, 9), tile=10, dtype='uint8'),
    Dim(name='row', domain=(0, 2), tile=3, dtype='uint8'),
  ]),
  attrs=[
    Attr(name='', dtype='uint8', var=False, nullable=False),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=10000,
  sparse=True,
  allows_duplicates=False,
)

    col  row
0     1    0  1
1     1    2  1
2     2    1  1
3     4    1  1
4     4    2  1
5     5    1  1
6     6    1  1
7     7    0  1
8     7    1  1
9     8    2  1
10    9    2  1

nguyenv avatar Jun 13 '22 16:06 nguyenv

Ohh I understand now. I was hoping the from_pandas handles the conversion by itself.

I got genomic data with thousands of columns.

royassis avatar Jun 13 '22 17:06 royassis

Thanks @royassis for checking TileDB out. If you could provide some information about your use case and the schema of the raw data, we can get back to you with an optimized array schema and optimized ingestion scripts. We have a lot of experience with genomics data. Cheers!

stavrospapadopoulos avatar Jun 13 '22 17:06 stavrospapadopoulos

Hey @stavrospapadopoulos I would love that, i'll check first with my colleagues on what data I can share.

royassis avatar Jun 13 '22 19:06 royassis

Hey again @stavrospapadopoulos

We work with many formats, but mainly with .h5ad. We have datasets with up to 30,000 genes and up to millions of cell_barcodes. Our data is usually very sparse.

We have many projects, one of them is a small streamlit app that reads data from .h5ad files and do some visualizations with Scanpy. As time went by and we added a few larger datasets we looked for a solution that will give better performance, we found tiledb.

A usual query will be to pull the values from of all cell barcodes but only from a small number of genes (no more than 10). In the query we will also pull some feature data (e.g cell type), do some aggregation per gene for each cell type and visualize the results.

When working with pandas a usual dataframe has genes + some added feature as the column names. Cell barcodes as the index of the dataframe. The data is mostly sprase (many zeros but all rows). And we want to do aggregations on a small subset of genes but for all barcodes.

royassis avatar Jun 13 '22 19:06 royassis

Oh, you are working with single-cell data. You are in luck! We are working closely with the Chan Zuckerberg Initiative to define a unified single-cell data model and API, that is interoperable with Seurat, Bioconductor and ScanPy. Please check the API spec and ongoing TileDB implementation of the spec below:

  • https://github.com/single-cell-data/SOMA
  • https://github.com/single-cell-data/TileDB-SingleCell

stavrospapadopoulos avatar Jun 13 '22 22:06 stavrospapadopoulos

@nguyenv @stavrospapadopoulos Thank you both.

I got more questions to ask, what is the best place for that ?

royassis avatar Jun 14 '22 05:06 royassis

You can join our Slack community or post questions on our forum.

stavrospapadopoulos avatar Jun 14 '22 12:06 stavrospapadopoulos