Indexing bugs
Hi @ihnorton, sorry to bother once more, but I think I found a couple of bugs in conjunction with indexing.
Setup:
# +
import json
import tiledb
import numpy as np
import pandas as pd
import random
# -
import json
test_df = pd.DataFrame.from_records(json.loads('{"chrom":{"0":"chr1","1":"chr1","2":"chr2","3":"chr2","4":"chr1","5":"chr1","8":"chr1","9":"chr1"},"log10_len":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"8":0,"9":0},"start":{"0":10108,"1":10108,"2":10108,"3":10108,"4":10108,"5":10108,"8":10143,"9":10143},"end":{"0":10114,"1":10114,"2":10114,"3":10114,"4":10114,"5":10114,"8":10144,"9":10144},"ref":{"0":"AACCCT","1":"AACCCT","2":"AACCCT","3":"AACCCT","4":"AACCCT","5":"AACCCT","8":"T","9":"T"},"alt":{"0":"A","1":"A","2":"A","3":"A","4":"A","5":"A","8":"C","9":"C"},"sample_id":{"0":"A","1":"B","2":"C","3":"D","4":"E","5":"F","8":"A","9":"B"},"GT":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"8":1,"9":1},"GQ":{"0":79,"2":60,"3":99,"4":26,"5":62,"8":22,"9":65},"DP":{"0":12,"1":9,"2":39,"3":26,"4":9,"5":9,"8":35,"9":34}}'))
test_df
output_path="test.tdb"
ctx = tiledb.default_ctx()
ctx
# +
genotype_domain = tiledb.Domain(
tiledb.Dim(name="chrom", domain=(None,None), tile=1, dtype=np.bytes_, ctx=ctx),
tiledb.Dim(name="log10_len", domain=(0, np.iinfo(np.int8).max), tile=1, dtype=np.int8, ctx=ctx),
tiledb.Dim(name="start", domain=(0, np.iinfo(np.int32).max), tile=100000, dtype=np.int32, ctx=ctx),
tiledb.Dim(name="alt", domain=(None,None), tile=None, dtype=np.bytes_, ctx=ctx),
# tiledb.Dim(name="end", domain=(1, np.iinfo(np.int32).max), dtype=np.int32, ctx=ctx),
tiledb.Dim(name="sample_id", domain=(None,None), tile=None, dtype=np.bytes_, ctx=ctx),
ctx=ctx,
)
string_filters = tiledb.FilterList([tiledb.ZstdFilter(level=-1),])
int_filters = tiledb.FilterList([tiledb.ByteShuffleFilter(), tiledb.ZstdFilter(level=-1),])
attrs = [
tiledb.Attr(name='end', dtype='int32', var=False, nullable=False, filters=int_filters),
tiledb.Attr(name='ref', dtype='S', nullable=False, filters=string_filters),
tiledb.Attr(name='GT', dtype='int8', var=False, nullable=False, filters=int_filters),
tiledb.Attr(name='GQ', dtype='int32', var=False, nullable=True, filters=int_filters),
tiledb.Attr(name='DP', dtype='int32', var=False, nullable=True, filters=int_filters),
]
# -
schema = tiledb.ArraySchema(
domain=genotype_domain,
attrs=attrs,
sparse=True,
cell_order="hilbert",
# capacity=10000,
ctx=ctx,
)
schema
if tiledb.array_exists(output_path):
print("Deleting array at '%s'..." % output_path)
import shutil
shutil.rmtree(output_path)
print("Creating array at '%s'..." % output_path)
tiledb.array.SparseArray.create(output_path, schema, ctx=ctx)
tiledb.from_dataframe(
output_path,
test_df.astype({
'end': 'int32',
'ref': 'S',
'GT': 'int8',
'GQ': 'Int32',
'DP': 'Int32',
}),
sparse=True,
mode="append"
)
tiledb.open_dataframe(output_path, use_arrow=True)
A = tiledb.open(output_path, ctx=ctx, mode='r')
A
Now my trials:
# works
A.query(use_arrow=True, coords=True).df[:]
# works
A["chr2"]
# +
# kernel breaks with `realloc(): invalid pointer`
# A.df["chr2"]
# kernel breaks with `realloc(): invalid pointer`
# A.query(use_arrow=True, coords=True).df["chr2"]
# -
# empty result
A.multi_index["chr2"]
# works
A["chr1", 0]
# works
A.multi_index[:]
# empty result
A.multi_index[:, 0]
# empty result
A.multi_index["chr2", 0]
# ---------------------------------------------------------------------------
# IndexError Traceback (most recent call last)
# <ipython-input-22-20675d6deb0c> in <module>
# 12
# 13 # IndexError: invalid index type: <class 'list'>
# ---> 14 A[[("chr1", 0), ("chr1", 1),]]
#
# tiledb/libtiledb.pyx in tiledb.libtiledb.SparseArrayImpl.__getitem__()
#
# tiledb/libtiledb.pyx in tiledb.libtiledb.SparseArrayImpl.subarray()
#
# tiledb/libtiledb.pyx in tiledb.libtiledb.index_domain_subarray()
#
# IndexError: invalid index type: <class 'list'>
A[[
("chr1", 0),
("chr1", 1),
]]
- Is there a way to keep the python kernel from crashing, even when having invalid input to tiledb?
- How to do key-based indexing? (
A[[("chr1", 0), ("chr1", 1),]]) - How to do
multi_indexindexing with DataFrame as return type?
Is there a way to keep the python kernel from crashing, even when having invalid input to tiledb?
kernel breaks with
realloc(): invalid pointer
Not sure what is going on there, I will take a look and debug tomorrow.
How to do key-based indexing? (A[[("chr1", 0), ("chr1", 1),]])
I think what you want is: A.multi_index["chr1", [0,1]] which will select "chr1" on first dim, and select 0 and 1 on 2nd dim. [edit: use A.multi_index["chr1", [0,1], :, :, :]]
How to do multi_index indexing with DataFrame as return type?
A.df[<selection>] is for that -- same semantics as .multi_index (and shared implementation for the selection parsing).
For these:
# empty result
A.multi_index[:, 0]
# empty result
A.multi_index["chr2", 0]
# empty result
A.multi_index["chr2"]
It is a bug; I think we have a fix in progress in libtiledb core, but I will see if we can apply the substance of the following work around automatically in TileDB-Py.
RIght now you can work around it like this (place-holder : for each non-indexed dimension):
In [16]: A.multi_index["chr2", :, :, :, :]
retries: 0
Out[16]:
OrderedDict([('chrom', array([b'chr2', b'chr2'], dtype=object)),
('log10_len', array([1, 1], dtype=int8)),
('start', array([10108, 10108], dtype=int32)),
('alt', array([b'A', b'A'], dtype=object)),
('sample_id', array([b'C', b'D'], dtype=object)),
('end', array([10114, 10114], dtype=int32)),
('ref', array([b'AACCCT', b'AACCCT'], dtype=object)),
('GT', array([1, 1], dtype=int8)),
('GQ', array([60, 99], dtype=int32)),
('DP', array([39, 26], dtype=int32))])
Also, this works for .df:
In [11]: A.query(use_arrow=True, coords=True).df["chr2", :, :, :,:]
retries: 0
Out[11]:
chrom log10_len start alt sample_id end ref GT GQ DP
0 chr2 1 10108 A C 10114 b'AACCCT' 1 60 39
1 chr2 1 10108 A D 10114 b'AACCCT' 1 99 26
and this (for my suggestion):
In [14]: A.df["chr1", [0,1], :, :, :]
retries: 0
Out[14]:
chrom log10_len start alt sample_id end ref GT GQ DP
0 chr1 1 10108 A A 10114 b'AACCCT' 1 79.0 12
1 chr1 1 10108 A B 10114 b'AACCCT' 1 NaN 9
2 chr1 0 10143 C B 10144 b'T' 1 65.0 34
3 chr1 0 10143 C A 10144 b'T' 1 22.0 35
4 chr1 1 10108 A E 10114 b'AACCCT' 1 26.0 9
5 chr1 1 10108 A F 10114 b'AACCCT' 1 62.0 9
and this one (avoids the crash):
In [15]: A.query(use_arrow=True, coords=True).df["chr2", :, :, :, :]
retries: 0
Out[15]:
chrom log10_len start alt sample_id end ref GT GQ DP
0 chr2 1 10108 A C 10114 b'AACCCT' 1 60 39
1 chr2 1 10108 A D 10114 b'AACCCT' 1 99 26
Alright, thanks!
I also did another test for the key-value type access:
30it/s is not too great, compared to RocksDB where we reach up to 40,000 it/s.
Is there something I can do about?
Hi @hoeze, are you able to share any more details about how you are using RocksDB? I’m not very familiar with it yet, and we’d like to dig in to the comparison a bit more.
Hi @Hoeze, a couple more notes on key-value queries:
- Not sure how you use RocksDB, but I would suggest you use a single string dimension in TileDB for such queries.
- You may want to tweak the tile capacity and compression for key-value queries (e.g., shrink the tile size, use no compression for local deployments).
- Issuing all key-value queries using
multi_rangewill be significantly faster than single-point queries in a loop. - We have some upcoming optimizations in the read algorithm that will benefit significantly even key-value queries.
- We have not optimized for (local) key-value performance yet, there are a couple of extra improvements we can do if this becomes critical (e.g., use
mmap, optimize the tile format and read algorithm for key-value queries using hashing, etc.)
If you provided with more details, we could take a closer look and see if there are any low-hanging fruits for optimizations here.
Hi, I set up a RocksDB example for you. To run it, you need to download the database attached here.
from pathlib import Path
import rocksdb
class VariantDB:
def __init__(self, path, rocksdb_options=None):
if rocksdb_options is None:
rocksdb_options = rocksdb.Options(
create_if_missing=True,
max_open_files=100,
)
self.db = rocksdb.DB(
path,
rocksdb_options,
read_only=True
)
@staticmethod
def _variant_to_byte(variant):
return bytes(str(variant), 'utf-8')
def _type(self, value):
raise NotImplementedError()
def _get(self, variant):
if not variant.startswith('chr'):
variant = 'chr%s' % variant
return self.db.get(self._variant_to_byte(variant))
def __getitem__(self, variant):
maf = self._get(variant)
if maf:
return self._type(maf)
else:
raise KeyError('This variant is not in the db')
def __contains__(self, variant):
return self._get(variant) is not None
def get(self, variant, default=None):
try:
return self[variant]
except KeyError:
return default
class VariantMafDB(VariantDB):
def _type(self, value):
return float(value)
mafdb_path = "maf.db"
mafdb = VariantMafDB(mafdb_path)
variants = [
'1:10468:T>TAA',
'1:10468:TCGCGG>T',
"10:107494853:C>A",
"10:107494857:C>A",
"10:107494858:T>C",
"10:107494873:C>T",
"10:107494874:G>A",
"10:107494905:GAGAA>G",
"10:107494908:A>G",
"10:107494929:T>C",
"10:107494933:T>C",
"10:107494935:G>A",
"10:107494937:C>G",
"10:107494941:CTTG>C",
"10:107494942:T>A",
"10:107494943:T>C",
"10:107494960:G>T",
"10:107494964:C>A",
"10:107494979:G>A",
"10:10749497:A>G",
"10:107494988:T>C",
"10:107494989:C>T",
"10:10749498:C>T",
"10:107494998:T>C",
"10:10749499:G>A",
"10:107495002:T>C",
]
Benchmark:
print(len(variants))
# 26
print([mafdb.get(v, 0) for v in variants])
# [0, 0, 0.00149653, 3.18451e-05, 3.18492e-05, 0.000223029, 0.014212, 3.18471e-05, 3.18573e-05, 3.18451e-05, 3.18634e-05, 3.18573e-05, 6.37349e-05, 3.18552e-05, 3.18431e-05, 0.000127502, 0.00350877, 3.18573e-05, 3.18471e-05, 3.18431e-05, 3.18492e-05, 0.0316323, 0.00012747, 3.18431e-05, 6.37105e-05, 3.18451e-05]
%timeit [mafdb.get(v, 0) for v in variants]
# 273 µs ± 4.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
- This example database is only 3.6GB of size, but we are working with database >100GB of size.
-
This RocksDB is very simple: its key is a string and the value a float. However, this also means that RocksDB has no idea about the data structure.1. Not sure how you use RocksDB, but I would suggest you use a single string dimension in TileDB for such queries.- Using the variant schema from my first post (without
sample_id,log10_lenand only a single attribute), TileDB should have a great chance to beat RocksDB since ordererd requests are always hitting the same fragment. - Also, keeping the different dimensions allows for range queries ("get all variants on chr1", etc.)
- Using the variant schema from my first post (without
-
Thanks for the hint. I tried without any compression filters but I'm still reaching only ~50it/s.2. You may want to tweak the tile capacity and compression for key-value queries (e.g., shrink the tile size, use no compression for local deployments). -
Yes, but I think this is only possible with a single key, right? I.e. querying at once3. Issuing all key-value queries using `multi_range` will be significantly faster than single-point queries in a loop.[("chr1", 0, 10108, "A"), ("chr1", 1, 10143, "C"),]will not work I believe?
Having a very fast TileDB solution with a single String dimension would already be a great improvement to us because:
- RocksDB is completely schema-free
- RocksDB does not support parallel writes
However, I believe that TileDB should also be able to improve on RocksDB speed when it is provided with some well-defined dimensions.
I hope my benchmark is of some use for you :)
Thanks for for the additional information @Hoeze, this is very valuable! We do have a lot of ideas on how to boost performance for this use case, as it is much simpler than the range queries we are currently performing and our current algorithms are an overkill. We will hopefully push them to the next couple of releases. Thanks again!
The bugs listed have been fixed as of 0.11.0 (switching out tiledb.from_dataframe for tiledb.from_pandas).
(tiledb-3.9) vivian@mangonada:~/TileDB-Py/tiledb$ python ~/tiledb-bugs/indexing_bugs.py
Deleting array at 'test.tdb'...
Creating array at 'test.tdb'...
OrderedDict([('end', array([10114, 10114, 10114, 10144, 10144, 10114, 10114, 10114],
dtype=int32)), ('ref', array([b'AACCCT', b'AACCCT', b'AACCCT', b'T', b'T', b'AACCCT', b'AACCCT',
b'AACCCT'], dtype=object)), ('GT', array([1, 1, 1, 1, 1, 1, 1, 1], dtype=int8)), ('GQ', array([79, 0, 60, 65, 22, 99, 26, 62], dtype=int32)), ('DP', array([12, 9, 39, 34, 35, 26, 9, 9], dtype=int32)), ('chrom', array([b'chr1', b'chr1', b'chr2', b'chr1', b'chr1', b'chr2', b'chr1',
b'chr1'], dtype=object)), ('log10_len', array([1, 1, 1, 0, 0, 1, 1, 1], dtype=int8)), ('start', array([10108, 10108, 10108, 10143, 10143, 10108, 10108, 10108],
dtype=int32)), ('alt', array([b'A', b'A', b'A', b'C', b'C', b'A', b'A', b'A'], dtype=object)), ('sample_id', array([b'A', b'B', b'C', b'B', b'A', b'D', b'E', b'F'], dtype=object))])
We will be benchmarking the code soon.