qdrant-client icon indicating copy to clipboard operation
qdrant-client copied to clipboard

individual shard_key when adding points

Open raulcarlomagno opened this issue 1 year ago • 4 comments

when uploading data to qdrant using custom sharding on collection with all available functions upsert(), upload_records(), upload_collection(), upload_points(), etc the shard_key_selector operates at global data, so before uploading, i have to batch data by custom shard key

i think it will be more useful to set custom shard key (aka shard_key_selector) at point level, not global

like this

qdrant_client.upsert(
                collection_name=COLLECTION_NAME,
                points=qdrant_models.Batch(
                    ids=[str(uuid.uuid4()) for _ in range(len(batch_keys))],
                    payloads=[dict(office=office_key, tmview_id=batch_key) for batch_key in batch_keys],
                    vectors=[features_by_office[office_key][batch_key] for batch_key in batch_keys],
                    shards_keys=[custom_shard_key for custom_shard_key in blablalblab] #this would be useful
                ),
                #shard_key_selector=office_key #not this
            )

and ids should be optional, if not set, qdrant server side should set it otherwise, i am creating lots of random ids on client side and send them, taking size of the final payload size

raulcarlomagno avatar Jan 23 '24 10:01 raulcarlomagno

this could be great feature for custom sharding

raulcarlomagno avatar Feb 06 '24 10:02 raulcarlomagno

Hi @raulcarlomagno,

One of our concerns is that if we support providing shard key per point, it will be easier to shoot yourself in the foot in terms of performance.

e.g. if shard_keys=["a", "b", "c", "d", "a", "b", "c", "d"] and batch size = 2, we would need to send 8 requests, 1 per point.

It also makes batching mechanism much more complex, and will probably lead to performance degradation.

joein avatar Feb 06 '24 10:02 joein

just in case: you don't need to construct the final batches, as you've said you only need to split data by shard keys

so it is a call to upload_collection per shard key, the final batching is still handled internally

regarding the situation with ids - it is done this way to preserve api idempotent from the server side

joein avatar Feb 07 '24 11:02 joein

the use case is i have much data (60 million records of more than 300 dimensions, unordered, not previously sharded) and i can't split them before, i read them in a mixed shard stream, and maybe when processing one batch, i have 2 vectors for a shard key, and 1500 for another shard key, so i have to perform small quantity batches just for really small shard keys, instead of sending them all to qdrant server, and let qdrant to split them to each shard

raulcarlomagno avatar Feb 07 '24 13:02 raulcarlomagno