milvus-sdk-java icon indicating copy to clipboard operation
milvus-sdk-java copied to clipboard

GetCollStatResponseWrapper randomly returns 0 size for collections in 2.3.x

Open r0x07k opened this issue 1 year ago • 4 comments

Hi,

The GetCollStatResponseWrapper randomly returns a zero row count for some collections. For others it still works ok, so it's unclear what the reason is.

For example, here is the collection in a format compatible with LangChain:

{'collection_name': 'test',
 'auto_id': False,
 'num_shards': 1,
 'description': '',
 'fields': [{'field_id': 100,
   'name': 'id',
   'description': '',
   'type': <DataType.VARCHAR: 21>,
   'params': {'max_length': 36},
   'is_primary': True},
  {'field_id': 101,
   'name': 'text',
   'description': '',
   'type': <DataType.VARCHAR: 21>,
   'params': {'max_length': 65535}},
  {'field_id': 102,
   'name': 'metadata',
   'description': '',
   'type': <DataType.JSON: 23>,
   'params': {}},
  {'field_id': 103,
   'name': 'vector',
   'description': '',
   'type': <DataType.FLOAT_VECTOR: 101>,
   'params': {'dim': 768}}],
 'aliases': [],
 'collection_id': 451819797554279738,
 'consistency_level': 0,
 'properties': {},
 'num_partitions': 1,
 'enable_dynamic_field': True}

The real row count:

[{'count(*)': 27}]

The Java code that returns 0:

R<GetCollectionStatisticsResponse> respCollectionStatistics = milvusClient.getCollectionStatistics(
    GetCollectionStatisticsParam.newBuilder()
      .withCollectionName(name)
      .build()
    );
GetCollStatResponseWrapper wrapperCollectionStatistics = new GetCollStatResponseWrapper(respCollectionStatistics.getData());
System.out.println(wrapperCollectionStatistics.getRowCount());

0

I use SDK 2.3.4 which is tied to LangChain4J.

r0x07k avatar Aug 13 '24 14:08 r0x07k

I tried to debug it further, and now I have two identical collections of size 27 (with different names), but wrapperCollectionStatistics returns 0 for one and the correct 27 for the other.

r0x07k avatar Aug 13 '24 18:08 r0x07k

The function of MilvusClient.getCollectionStatistics() in Java SDK is equal to the Collection.num_entities in Milvus Python SDK. This API returns a raw number of entities. It gets the number from Etcd by summing up row numbers of all sealed segments.

As we know, when users call insert() to insert entities into a collection, the insert request is passed to Pulsar, and consumed by querynode/datanode asynchronously. The datanode accumulates entities in a memory buffer, once the buffer size exceeds a threshold, the datanode flushes the buffer to be a sealed segment. Only when a sealed segment is persisted, its row number is recorded into Etcd.

So, the number returns from MilvusClient.getCollectionStatistics() is not accurate. To get an accurate number, use "count(*)".

This is an example of MilvusClientV2 to get row number: It is a query request. Use the ConsistencyLevel to control the data visibility. "ConsistencyLevel.STRONG" means this query will wait until all data is consumed by querynode. Note: the data in pulsar cannot be queried.

        QueryResp queryResp = client.query(QueryReq.builder()
                .collectionName(collectionName)
                .filter("")
                .outputFields(Collections.singletonList("count(*)"))
                .consistencyLevel(ConsistencyLevel.STRONG)
                .build());
        List<QueryResp.QueryResult> queryResults = queryResp.getQueryResults();
        return (long)queryResults.get(0).getEntity().get("count(*)");

yhmo avatar Aug 14 '24 11:08 yhmo

Thank you, @yhmo. We’ll proceed with this approach.

Could you also let me know if there are any plans to deprecate MilvusClient.getCollectionStatistics()?

r0x07k avatar Aug 14 '24 13:08 r0x07k

getCollectionStatistics() is much faster than query("count(*)") because getCollectionStatistics() quickly picks the number from Etcd but query() requires the collection to be loaded and iterates all the segments to sum up the number. Sometimes users only want to know a raw number and don't intend to load the collection. So I think the getCollectionStatistics() should not be marked as deprecated.

In the python sdk, the Collection.num_entities is not deprecated either: https://github.com/milvus-io/pymilvus/blob/master/pymilvus/orm/collection.py#L265

yhmo avatar Aug 15 '24 02:08 yhmo