tensorstore icon indicating copy to clipboard operation
tensorstore copied to clipboard

efficient interation over key-value store

Open fcollman opened this issue 9 months ago • 5 comments

If i want to do some iteration over all the values in a large key-value store, is there an established pattern for how to do that in tensorstore. My immediate application is for iterating over large scale precomputed annotations. I thought perhaps using KeyRange option on the .list method would enable chunking and therefore create a route toward an efficient iterator... however it takes 5 minutes to run over a large dataset even with a pretty restrictive range (see example below).

import tensorstore as ts
import os

source= "gs://neuroglancer-20191211_fafbv14_buhmann2019_li20190805"
ts_spec = {
            "driver": "json",
            "kvstore": os.path.join(source, "info"),
        }
info_ts = ts.open(ts_spec).result()
info = info_ts.read().result().item()
by_id_info = info["by_id"]
ts_spec = {
    "base": os.path.join(source, by_id_info['key'])
}
if "sharding" in by_id_info.keys():
    ts_spec["driver"] = "neuroglancer_uint64_sharded"
    ts_spec["metadata"] = by_id_info["sharding"]
else:
    ts_spec["driver"] = "neuroglancer_precomputed"

ts_by_id = ts.KvStore.open(ts_spec).result()

start_bytes = np.ascontiguousarray(17317160-5000, dtype=">u8").tobytes()
end_bytes = np.ascontiguousarray(17317160+5000, dtype=">u8").tobytes()

key_range = ts.KvStore.KeyRange(
                inclusive_min=start_bytes, exclusive_max=end_bytes
            )
keys=ts_by_id.list(range=key_range).result()

fcollman avatar Mar 30 '25 20:03 fcollman

Currently listing for neuroglancer_uint64_sharded is not optimized --- internally it is only used for unit tests so it just reads all of the keys (which requires reading all minishard indices) and then filters based on the range.

It could be optimized in certain cases, e.g. if the key range is guaranteed to be within a single minishard, or within a single shard.

More generally our list API could use improvement --- with support for partitioning ranges efficiently for listing in parallel, for example. Though for neuroglancer_uint64_sharded, even though partitioning by shard and minishard would always be possible, those don't necessarily correspond to lexicographical key ranges so it would be tricky to represent such partitions.

Note: the `ts_spec['driver'] = 'neuroglancer_precomputed' line (not used) is not correct.

jbms avatar Apr 02 '25 20:04 jbms

Ah yes, I realized that after posting this and corrected.

fcollman avatar Apr 02 '25 21:04 fcollman

I think i would be happy with just any iterator API, so if for neuroglancer_uint64_sharded that ended up returning things in all IDs in a minishard, all minishards in a shard, all shards in the dataset style order that would be fine.

fcollman avatar Apr 02 '25 21:04 fcollman

Perhaps related: If I want to write hundreds of millions of precomputed annotations, it would be helpful to use multiple transactions (one per shard). But predicting which annotation goes to which shard is not trivial (depending on the shard spec), so at the moment I just write them all in one big transaction. That increases my RAM requirements considerably (hundreds of GB).

Is there a (python) function in tensorstore than can accept a sharding spec and a list of keys, and then tell me which shard each key would be grouped into? Then I could plan my transactions accordingly.

stuarteberg avatar Aug 01 '25 14:08 stuarteberg

I have use cases similar to those described above:

I'd like to be able to pull data iterating over shards or minishards as @fcollman said (and I don't care too much about which order, whether shard or minishard, etc). I'm not sure how to accomplish this with the current API.

And, ideally I'd also like to be able to predict which of my keys will be in each shard as @stuarteberg said. Would it be possible to expose the hashing function a KvStore uses to the Python API, based on all of the spec parameters?

bdpedigo avatar Aug 28 '25 20:08 bdpedigo