TileDB-Py Indexing bugs

Hi @ihnorton, sorry to bother once more, but I think I found a couple of bugs in conjunction with indexing.

Setup:

# +
import json

import tiledb
import numpy as np
import pandas as pd
import random
# -

import json
test_df = pd.DataFrame.from_records(json.loads('{"chrom":{"0":"chr1","1":"chr1","2":"chr2","3":"chr2","4":"chr1","5":"chr1","8":"chr1","9":"chr1"},"log10_len":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"8":0,"9":0},"start":{"0":10108,"1":10108,"2":10108,"3":10108,"4":10108,"5":10108,"8":10143,"9":10143},"end":{"0":10114,"1":10114,"2":10114,"3":10114,"4":10114,"5":10114,"8":10144,"9":10144},"ref":{"0":"AACCCT","1":"AACCCT","2":"AACCCT","3":"AACCCT","4":"AACCCT","5":"AACCCT","8":"T","9":"T"},"alt":{"0":"A","1":"A","2":"A","3":"A","4":"A","5":"A","8":"C","9":"C"},"sample_id":{"0":"A","1":"B","2":"C","3":"D","4":"E","5":"F","8":"A","9":"B"},"GT":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"8":1,"9":1},"GQ":{"0":79,"2":60,"3":99,"4":26,"5":62,"8":22,"9":65},"DP":{"0":12,"1":9,"2":39,"3":26,"4":9,"5":9,"8":35,"9":34}}'))
test_df

output_path="test.tdb"

ctx = tiledb.default_ctx()
ctx

# +
genotype_domain = tiledb.Domain(
    tiledb.Dim(name="chrom", domain=(None,None), tile=1, dtype=np.bytes_, ctx=ctx),
    tiledb.Dim(name="log10_len", domain=(0, np.iinfo(np.int8).max), tile=1, dtype=np.int8, ctx=ctx),
    tiledb.Dim(name="start", domain=(0, np.iinfo(np.int32).max), tile=100000, dtype=np.int32, ctx=ctx),
    tiledb.Dim(name="alt", domain=(None,None), tile=None, dtype=np.bytes_, ctx=ctx),
#     tiledb.Dim(name="end", domain=(1, np.iinfo(np.int32).max), dtype=np.int32, ctx=ctx),
    tiledb.Dim(name="sample_id", domain=(None,None), tile=None, dtype=np.bytes_, ctx=ctx),
    ctx=ctx,
)

string_filters = tiledb.FilterList([tiledb.ZstdFilter(level=-1),])
int_filters = tiledb.FilterList([tiledb.ByteShuffleFilter(), tiledb.ZstdFilter(level=-1),])
attrs = [
    tiledb.Attr(name='end', dtype='int32', var=False, nullable=False, filters=int_filters),
    tiledb.Attr(name='ref', dtype='S', nullable=False, filters=string_filters),
    tiledb.Attr(name='GT', dtype='int8', var=False, nullable=False, filters=int_filters),
    tiledb.Attr(name='GQ', dtype='int32', var=False, nullable=True, filters=int_filters),
    tiledb.Attr(name='DP', dtype='int32', var=False, nullable=True, filters=int_filters),
]
# -

schema = tiledb.ArraySchema(
    domain=genotype_domain,
    attrs=attrs,
    sparse=True,
    cell_order="hilbert",
#     capacity=10000,
    ctx=ctx,
)
schema

if tiledb.array_exists(output_path):
    print("Deleting array at '%s'..." % output_path)
    import shutil
    shutil.rmtree(output_path)
print("Creating array at '%s'..." % output_path)
tiledb.array.SparseArray.create(output_path, schema, ctx=ctx)

tiledb.from_dataframe(
    output_path,
    test_df.astype({
        'end': 'int32',
        'ref': 'S',
        'GT': 'int8',
        'GQ': 'Int32',
        'DP': 'Int32',
    }),
    sparse=True,
    mode="append"
)

tiledb.open_dataframe(output_path, use_arrow=True)

A = tiledb.open(output_path, ctx=ctx, mode='r')
A

Now my trials:

# works
A.query(use_arrow=True, coords=True).df[:]

# works
A["chr2"]

# +
# kernel breaks with `realloc(): invalid pointer`
# A.df["chr2"]
# kernel breaks with `realloc(): invalid pointer`
# A.query(use_arrow=True, coords=True).df["chr2"]
# -

# empty result
A.multi_index["chr2"]

# works
A["chr1", 0]

# works
A.multi_index[:]

# empty result
A.multi_index[:, 0]

# empty result
A.multi_index["chr2", 0]

# ---------------------------------------------------------------------------
# IndexError                                Traceback (most recent call last)
# <ipython-input-22-20675d6deb0c> in <module>
#      12 
#      13 # IndexError: invalid index type: <class 'list'>
# ---> 14 A[[("chr1", 0), ("chr1", 1),]]
#
# tiledb/libtiledb.pyx in tiledb.libtiledb.SparseArrayImpl.__getitem__()
#
# tiledb/libtiledb.pyx in tiledb.libtiledb.SparseArrayImpl.subarray()
#
# tiledb/libtiledb.pyx in tiledb.libtiledb.index_domain_subarray()
#
# IndexError: invalid index type: <class 'list'>
A[[
    ("chr1", 0), 
    ("chr1", 1),
]]

Is there a way to keep the python kernel from crashing, even when having invalid input to tiledb?
How to do key-based indexing? (A[[("chr1", 0), ("chr1", 1),]])
How to do multi_index indexing with DataFrame as return type?

Jun 13 '21 21:06 Hoeze

Is there a way to keep the python kernel from crashing, even when having invalid input to tiledb?

kernel breaks with realloc(): invalid pointer

Not sure what is going on there, I will take a look and debug tomorrow.

How to do key-based indexing? (A[[("chr1", 0), ("chr1", 1),]])

I think what you want is: A.multi_index["chr1", [0,1]] which will select "chr1" on first dim, and select 0 and 1 on 2nd dim. [edit: use A.multi_index["chr1", [0,1], :, :, :]]

How to do multi_index indexing with DataFrame as return type?

A.df[<selection>] is for that -- same semantics as .multi_index (and shared implementation for the selection parsing).

Jun 14 '21 01:06 ihnorton

For these:

# empty result
A.multi_index[:, 0]

# empty result
A.multi_index["chr2", 0]

# empty result
A.multi_index["chr2"]

It is a bug; I think we have a fix in progress in libtiledb core, but I will see if we can apply the substance of the following work around automatically in TileDB-Py.

RIght now you can work around it like this (place-holder : for each non-indexed dimension):

In [16]: A.multi_index["chr2", :, :, :, :]
retries: 0
Out[16]:
OrderedDict([('chrom', array([b'chr2', b'chr2'], dtype=object)),
             ('log10_len', array([1, 1], dtype=int8)),
             ('start', array([10108, 10108], dtype=int32)),
             ('alt', array([b'A', b'A'], dtype=object)),
             ('sample_id', array([b'C', b'D'], dtype=object)),
             ('end', array([10114, 10114], dtype=int32)),
             ('ref', array([b'AACCCT', b'AACCCT'], dtype=object)),
             ('GT', array([1, 1], dtype=int8)),
             ('GQ', array([60, 99], dtype=int32)),
             ('DP', array([39, 26], dtype=int32))])

Also, this works for .df:

In [11]: A.query(use_arrow=True, coords=True).df["chr2", :, :, :,:]
retries: 0
Out[11]:
  chrom  log10_len  start alt sample_id    end        ref  GT  GQ  DP
0  chr2          1  10108   A         C  10114  b'AACCCT'   1  60  39
1  chr2          1  10108   A         D  10114  b'AACCCT'   1  99  26

and this (for my suggestion):

In [14]: A.df["chr1", [0,1], :, :, :]
retries: 0
Out[14]:
  chrom  log10_len  start alt sample_id    end        ref  GT    GQ  DP
0  chr1          1  10108   A         A  10114  b'AACCCT'   1  79.0  12
1  chr1          1  10108   A         B  10114  b'AACCCT'   1   NaN   9
2  chr1          0  10143   C         B  10144       b'T'   1  65.0  34
3  chr1          0  10143   C         A  10144       b'T'   1  22.0  35
4  chr1          1  10108   A         E  10114  b'AACCCT'   1  26.0   9
5  chr1          1  10108   A         F  10114  b'AACCCT'   1  62.0   9

and this one (avoids the crash):

In [15]: A.query(use_arrow=True, coords=True).df["chr2", :, :, :, :]
retries: 0
Out[15]:
  chrom  log10_len  start alt sample_id    end        ref  GT  GQ  DP
0  chr2          1  10108   A         C  10114  b'AACCCT'   1  60  39
1  chr2          1  10108   A         D  10114  b'AACCCT'   1  99  26

Jun 14 '21 01:06 ihnorton

Alright, thanks!

I also did another test for the key-value type access: grafik 30it/s is not too great, compared to RocksDB where we reach up to 40,000 it/s. Is there something I can do about?

Jun 14 '21 09:06 Hoeze

Hi @hoeze, are you able to share any more details about how you are using RocksDB? I’m not very familiar with it yet, and we’d like to dig in to the comparison a bit more.

Jun 14 '21 14:06 ihnorton

Hi @Hoeze, a couple more notes on key-value queries:

Not sure how you use RocksDB, but I would suggest you use a single string dimension in TileDB for such queries.
You may want to tweak the tile capacity and compression for key-value queries (e.g., shrink the tile size, use no compression for local deployments).
Issuing all key-value queries using multi_range will be significantly faster than single-point queries in a loop.
We have some upcoming optimizations in the read algorithm that will benefit significantly even key-value queries.
We have not optimized for (local) key-value performance yet, there are a couple of extra improvements we can do if this becomes critical (e.g., use mmap, optimize the tile format and read algorithm for key-value queries using hashing, etc.)

If you provided with more details, we could take a closer look and see if there are any low-hanging fruits for optimizations here.

Jun 14 '21 18:06 stavrospapadopoulos

Hi, I set up a RocksDB example for you. To run it, you need to download the database attached here.

from pathlib import Path
import rocksdb


class VariantDB:

    def __init__(self, path, rocksdb_options=None):
        if rocksdb_options is None:
            rocksdb_options = rocksdb.Options(
                create_if_missing=True,
                max_open_files=100,
            )

        self.db = rocksdb.DB(
            path,
            rocksdb_options,
            read_only=True
        )

    @staticmethod
    def _variant_to_byte(variant):
        return bytes(str(variant), 'utf-8')

    def _type(self, value):
        raise NotImplementedError()

    def _get(self, variant):
        if not variant.startswith('chr'):
            variant = 'chr%s' % variant
        return self.db.get(self._variant_to_byte(variant))

    def __getitem__(self, variant):
        maf = self._get(variant)
        if maf:
            return self._type(maf)
        else:
            raise KeyError('This variant is not in the db')

    def __contains__(self, variant):
        return self._get(variant) is not None

    def get(self, variant, default=None):
        try:
            return self[variant]
        except KeyError:
            return default


class VariantMafDB(VariantDB):

    def _type(self, value):
        return float(value)


mafdb_path = "maf.db"
mafdb = VariantMafDB(mafdb_path)
variants = [
        '1:10468:T>TAA',
        '1:10468:TCGCGG>T',
        "10:107494853:C>A",
        "10:107494857:C>A",
        "10:107494858:T>C",
        "10:107494873:C>T",
        "10:107494874:G>A",
        "10:107494905:GAGAA>G",
        "10:107494908:A>G",
        "10:107494929:T>C",
        "10:107494933:T>C",
        "10:107494935:G>A",
        "10:107494937:C>G",
        "10:107494941:CTTG>C",
        "10:107494942:T>A",
        "10:107494943:T>C",
        "10:107494960:G>T",
        "10:107494964:C>A",
        "10:107494979:G>A",
        "10:10749497:A>G",
        "10:107494988:T>C",
        "10:107494989:C>T",
        "10:10749498:C>T",
        "10:107494998:T>C",
        "10:10749499:G>A",
        "10:107495002:T>C",
    ]

Benchmark:

print(len(variants))
# 26
print([mafdb.get(v, 0) for v in variants])
# [0, 0, 0.00149653, 3.18451e-05, 3.18492e-05, 0.000223029, 0.014212, 3.18471e-05, 3.18573e-05, 3.18451e-05, 3.18634e-05, 3.18573e-05, 6.37349e-05, 3.18552e-05, 3.18431e-05, 0.000127502, 0.00350877, 3.18573e-05, 3.18471e-05, 3.18431e-05, 3.18492e-05, 0.0316323, 0.00012747, 3.18431e-05, 6.37105e-05, 3.18451e-05]
%timeit [mafdb.get(v, 0) for v in variants]
# 273 µs ± 4.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This example database is only 3.6GB of size, but we are working with database >100GB of size.
```
1. Not sure how you use RocksDB, but I would suggest you use a single string dimension in TileDB for such queries.
```
This RocksDB is very simple: its key is a string and the value a float. However, this also means that RocksDB has no idea about the data structure.
- Using the variant schema from my first post (without sample_id, log10_len and only a single attribute), TileDB should have a great chance to beat RocksDB since ordererd requests are always hitting the same fragment.
- Also, keeping the different dimensions allows for range queries ("get all variants on chr1", etc.)

2. You may want to tweak the tile capacity and compression for key-value queries (e.g., shrink the tile size, use no compression for local deployments).

Thanks for the hint. I tried without any compression filters but I'm still reaching only ~50it/s.

```
3. Issuing all key-value queries using `multi_range` will be significantly faster than single-point queries in a loop.
```
Yes, but I think this is only possible with a single key, right? I.e. querying at once [("chr1", 0, 10108, "A"), ("chr1", 1, 10143, "C"),] will not work I believe?

Having a very fast TileDB solution with a single String dimension would already be a great improvement to us because:

RocksDB is completely schema-free
RocksDB does not support parallel writes

However, I believe that TileDB should also be able to improve on RocksDB speed when it is provided with some well-defined dimensions.

I hope my benchmark is of some use for you :)

Jun 15 '21 22:06 Hoeze

Thanks for for the additional information @Hoeze, this is very valuable! We do have a lot of ideas on how to boost performance for this use case, as it is much simpler than the range queries we are currently performing and our current algorithms are an overkill. We will hopefully push them to the next couple of releases. Thanks again!

Jun 16 '21 19:06 stavrospapadopoulos

The bugs listed have been fixed as of 0.11.0 (switching out tiledb.from_dataframe for tiledb.from_pandas).

(tiledb-3.9) vivian@mangonada:~/TileDB-Py/tiledb$ python ~/tiledb-bugs/indexing_bugs.py
Deleting array at 'test.tdb'...
Creating array at 'test.tdb'...
OrderedDict([('end', array([10114, 10114, 10114, 10144, 10144, 10114, 10114, 10114],
      dtype=int32)), ('ref', array([b'AACCCT', b'AACCCT', b'AACCCT', b'T', b'T', b'AACCCT', b'AACCCT',
       b'AACCCT'], dtype=object)), ('GT', array([1, 1, 1, 1, 1, 1, 1, 1], dtype=int8)), ('GQ', array([79,  0, 60, 65, 22, 99, 26, 62], dtype=int32)), ('DP', array([12,  9, 39, 34, 35, 26,  9,  9], dtype=int32)), ('chrom', array([b'chr1', b'chr1', b'chr2', b'chr1', b'chr1', b'chr2', b'chr1',
       b'chr1'], dtype=object)), ('log10_len', array([1, 1, 1, 0, 0, 1, 1, 1], dtype=int8)), ('start', array([10108, 10108, 10108, 10143, 10143, 10108, 10108, 10108],
      dtype=int32)), ('alt', array([b'A', b'A', b'A', b'C', b'C', b'A', b'A', b'A'], dtype=object)), ('sample_id', array([b'A', b'B', b'C', b'B', b'A', b'D', b'E', b'F'], dtype=object))])

We will be benchmarking the code soon.

Nov 17 '21 14:11 nguyenv

Indexing bugs

kernel breaks with realloc(): invalid pointer

kernel breaks with `realloc(): invalid pointer`