TileDB-Py
TileDB-Py copied to clipboard
Documentation about multi_index and query
I can't find mentions of multi_index nor for the query() method in the official docs - been using multi_index but it is outputting a lot more information that I need (about positions in the array, then the values themselves). Is there a parameter to output just a list of results containing only values following the order of the slices? And what is the purpose of .query, is there any more to it than just another way to read results instead of using A[:] ?
Is there any efficiency benefit at using multi_index vs looping through intervals using query or the normal [] read mode?
Hi @michael-imbeault, I will be taking a pass through the API docs this week to add some missing items, as well as fix a rendering issue preventing some docstrings from displaying. We also have documentation of multi_index
specifically at https://docs.tiledb.com/main/api-usage/reading-arrays/multi-range-subarrays
Here is a summary for multi_index
and query
:
multi_index
:
- supports multiple sub-range queries per dimension and returns the cross-product of the specified ranges. Here is an example from the doc link above:
# slice subarrays [1,2]x[1,4] and [4,4]x[1,4]
A.multi_index[ [slice(1,2), 4], 1:4 ]
- to expand on this:
multi_index
accepts a range (start:end
),slice(start,end)
, or a list ofslice
objects or scalar index. For example:
A.multi_index[ [slice(1,2), 4], [slice(3,4), slice(5,6), 8] ]`
-
multi_index
operates over the full, inclusive domain of the array - ...results are endpoint inclusive, like TileDB core -- and unlike standard python slicing (TileDB arrays may be defined with dimensions that have arbitrary float or int start/end-points, and
multi_index
allows to query such intervals) - ... also meaning that there is no wrap around for negative indexes to access the "last" element in the array
-
multi_index
returns result coordinates for all dimensions, as separate named arrays (corresponding to the Dimension name)
.query
:
- the main purpose is to allow sub-selection on attributes, by passing a list of attributes and only querying those attributes. For example, this query will only return values for
a
andb
, excluding any other attributes
A.query(attrs=['a','b']).multi_index[...]
Is there any efficiency benefit at using multi_index vs looping through intervals using query or the normal [] read mode?
For large multi-ranged queries, there can be a significant benefit to using multi_index
, because TileDB is designed to efficiently fulfill such a query even for a very large number of ranges (parallelizing operations across multiple threads; storing range bounding boxes for tiles to optimize retrieval; selectively decompressing tiles; and other optimizations).
There can be an efficiency benefit to using .query
if you know that some attribute results will not be needed, because core TileDB will not retrieve data for those attributes at all, reducing i/o and memory usage, etc.
Ok that's helpful - I did find https://docs.tiledb.com/main/api-usage/reading-arrays/multi-range-subarrays but its a little barebones at the moment - no mention of either multi_index nor query in https://tiledb-inc-tiledb-py.readthedocs-hosted.com/en/stable/python-api.html.
I'll be using multi_index - my initial expectation was that it would return a list of numpy arrays corresponding to the slices, not a dict with a single array encompassing all the slices I have to parse using the coordinate arrays. Is there plans to include a simple, already parsed output? The current way make sense for sparse arrays but seems suboptimal for dense arrays - creating those (potentially very large) coord arrays and keeping them in memory seems wasteful for some use cases.