milvus
milvus copied to clipboard
[Bug]: I do not get the rt result displayed in "Milvus 2.1 Benchmark Test Report"
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version:2.1.1
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2):pymilvus v2.1
- OS(Ubuntu or CentOS): centos
- CPU/Memory: 32 cores/756gb
- GPU:
- Others:
Current Behavior
I have tested the milvus2.1.0 in sift1m dataset flowing the steps in here "https://milvus.io/docs/v2.1.x/benchmark.md" and I got a different rt and Relatively low recall.
Here are my result:
(1)oncall
oncall@10~94%
(2)rt
avg~40ms and p99>100ms
ps:querynode’s cpu and memory is still rich.
Expected Behavior
rt(p99)<60ms and oncall@10 >98%
Steps To Reproduce
No response
Milvus Log
No response
Anything else?
No response
@dzqoo thank you for verifying the benchmark.
- i guess the recall is expected, as we used {M: 16, efConstruction: 500} in recall benchmark, while {M: 8, efConstruction: 200} in search benchmark.
- about the latency, it could be impacted by network latency, or different hardware. Also how did you test the latency? there is a blog that may help you test our benchmark quickly, please take a look: https://milvus.io/blog/2022-08-16-A-Quick-Guide-to-Benchmarking-Milvus-2-1.md
@dzqoo please feel free to share us the network latency, delpoyment configuration, hardware info, the concurrent number, and your test code(if possible) to help you get the latency in benchmark report.
/assign @dzqoo /unassign
so we split the 32 cores machine into multiple querynode right?
I guess it might make more sense to use only one querynode is working~ because for SIFT 1 M there maybe only one shard working.
@yanliang567 Thank you for answering.
-
the network latency is ok. Both the client and server are in the same two machine. And the latency is also got from milvus metrics and I guess the network latency can not affect this;
-
Milvus is deployed under default configuration that the cpu and memory is not limit;
-
hardware info is showed in the following:
;
-
concurrent number is 400;
-
test code: `def performance(client, collection_name, search_param): index_type = client.get_index_params(collection_name) if index_type: index_type = index_type[0]['index_type'] else: index_type = 'FLAT' search_params = get_search_params(search_param, index_type) if not os.path.exists(PERFORMANCE_RESULTS_PATH): os.mkdir(PERFORMANCE_RESULTS_PATH) result_filename = collection_name + '_' + str(search_param) + '_performance.csv' performance_file = os.path.join(PERFORMANCE_RESULTS_PATH, result_filename)
with open(performance_file, 'w+', encoding='utf-8') as f: f.write("nq,topk,total_time,avg_time" + '\n') for nq in NQ_SCOPE: query_list = get_nq_vec(nq) LOGGER.info(f"begin to search, nq = {len(query_list)}") for topk in TOPK_SCOPE: time_start = time.time() client.search_vectors(collection_name, query_list, topk, search_params) time_cost = time.time() - time_start print(nq, topk, time_cost) line = str(nq) + ',' + str(topk) + ',' + str(round(time_cost, 4)) + ',' + str( round(time_cost / nq, 4)) + '\n' f.write(line) f.write('\n') LOGGER.info("search_vec_list done !")`
@dzqoo here are something that our benchmark is different from you test code, please update and retry:
- use hnsw index with index param {M: 8, efConstruction: 200}, search param {ef:64}
- use go sdk to do concurrent search instead of python sdk
- configure one querynode with 12core at least instead of multiple querynodes As I mentioned above, there offers some scripts in the blog that would help you test the benchmark. BTW, you can update concurrent number in go_benchmark.py(find it in the blog)
@yanliang567 I just run the office code and get the flowing result and all config is office default. [2022-08-18 19:44:48.818][ INFO] - Name # reqs # fails | Avg Min Max Median | req/s failures/s (benchmark_run.go:212:sample) [2022-08-18 19:44:48.818][ INFO] - go search 3477 0(0.00%) | 57.469 30.851 512.164 56.580 | 173.85 0.00 (benchmark_run.go:213:sample) [2022-08-18 19:45:08.818][ INFO] - Name # reqs # fails | Avg Min Max Median | req/s failures/s (benchmark_run.go:212:sample) [2022-08-18 19:45:08.818][ INFO] - go search 6952 0(0.00%) | 57.518 34.053 237.674 57.297 | 173.77 0.00 (benchmark_run.go:213:sample) [2022-08-18 19:45:28.818][ INFO] - Name # reqs # fails | Avg Min Max Median | req/s failures/s (benchmark_run.go:212:sample) [2022-08-18 19:45:28.818][ INFO] - go search 10460 0(0.00%) | 57.000 34.097 252.618 56.458 | 175.43 0.00 (benchmark_run.go:213:sample) [2022-08-18 19:45:28.818][ DEBUG] - go search run finished, parallel: 10(benchmark_run.go:95:benchmark) [2022-08-18 19:45:28.818][ INFO] - Name # reqs # fails | Avg Min Max Median | req/s failures/s (benchmark_run.go:159:samplingLoop) [2022-08-18 19:45:28.818][ INFO] - go search 10470 0(0.00%) | 57.322 30.851 512.164 56.760 | 174.40 0.00
Which is not every well...
@yanliang567
I reduce querynode num from 8 to 2 and I got this result.
Here is my querynode cpu and memory config:
I guess it is still far behind the office result.
@dzqoo cool, you got the benchmark scripts. As I mentioned above, now you can increase the concurrent_num in go_benchmark.py foot by foot(the default value go_benchmark.py is 10), and find the best qps and latency in certain concurrent_num. BTW, according to the report, the best concurrent is about 400. Enjoy~:)
Please keep us posted when you got the best pqs and latency. Also please share the grafana metrics of milvus if convenient. thanks again. @dzqoo
@yanliang567 I wonder how many querynode is configed under the office benchmark test and how the replicas is configed.
@yanliang567 I wonder how many querynode is configed under the office benchmark test and how the replicas is configed.
one querynode, one replica
This is amazing! Mine test has two query node in 20 cores host machine and has one replica, but mine test result really can not reach so good performance. I don't know why.
I have another question. When I increase the nq and the qps is significant decreases.Here are the result:
-
when nq = 1, I got this -> qps =124
-
when nq = 10, I got this -> qps = 27.69
Other config is keepping the same. So what can I do when I increase the nq in production scene. @yanliang567 Look forwarding your apply.
As there is only 1 million vectors, only 1 or 2 segments would be generated. so 2 querynodes may not have better performance. try this
- use one shard when create the collection
- wait until build index completed, and the load collection, to ensure all searches are running on index (try release and load collection to ensure this)
@dzqoo increasing nq would cause latency decrease, and qps would be decrease as well, this is expected. In production, you can now try multiple replicas to increase qps while keeping latency roughly stable.
BTW, I notice that the last result you pasted has the different search params, and all these params would impact the performance.
- dim: 128->400
- dataset: 1m->10m
- metric_type: L2->IP
- ef: 64->256
@yanliang567 Yes ,I have tested the difference in my datatset.
I have notice that My replicas group num keeps 2 when I increase replicas. And I wonder it's normal?
Here is the ouput:
def load_balance_in_one_group(self,collection_name): try: milvus_sys = MilvusSys() querynode_num = len(milvus_sys.query_nodes()) if(querynode_num < 3): LOGGER.warn("skip load balance for multi replicas case when querynode number less than 3") sys.exit(1) self.set_collection(collection_name=collection_name) res = self.collection.get_replicas() print(res.groups) group_nodes = [] for g in res.groups: if len(g.group_nodes) >= 2: group_nodes = list(g.group_nodes) break res = utility.get_query_segment_info(collection_name=collection_name) print(group_nodes) print(res) except Exception as e: LOGGER.error("Failed to balance : {}".format(e)) sys.exit(1)
@dzqoo this is expected as you only have 2 querynodes. In short, replica_number <= querynode_number
@yanliang567
I increase querynode num from 2 to 9, the replicas group is still 2. Here is the result
how did you increase the replica? did yo do like this?
- collection.release()
- collection.load(replicas=4)
@yanliang567 Yes ,I did this. Here is the code:
REPLICA_NUMBER = 5 def load_data(self, collection_name): # load data from disk to try: self.set_collection(collection_name) self.collection.load(replica_number=REPLICA_NUMBER) except Exception as e: LOGGER.error(f"Failed load data: {e}") sys.exit(1)
@dzqoo I mean you have to release and reload collection to change the replica number. did you release collection
@yanliang567 surely
Mmm... then we have to look into the logs. @dzqoo Could you please refer this script to export the whole Milvus logs for investigation?
@yanliang567 Here are the all logs exported by office script.Please have a look.Thank you~ logs.tar.gz
/assign @jiaoew1991 /unassign
@dzqoo just curious, did you get the benchmark report result if using the same dataset and configurations as benchmark report?
@yanliang567 I did this also, BUT the result is still not that good. AS for the unchanged replicas group , do you have any suggestions?
hi @dzqoo , from querycoord's logs we can find the replica number is 5, you can check it by this file replica.log,
@yanliang567 I increase querynode num from 2 to 9, the replicas group is still 2. Here is the result
Also to answer the [65, 56] that appears in the log is a node group, not a replica group, which means these two node combined together to become a replica
Also, from the log, the shard number is still 2, not 1
@jiaoew1991 Emmm,I got that.Thank you for answering. I will chang the shard number and retest.
@jiaoew1991 Emmm,I got that.Thank you for answering. I will chang the shard number and retest.
@dzqoo actually i don't think you have to update shard number to 1, because you have more than 1 million vectors. We set shard=1 only when running perf on 1 million dataset.