milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: I do not get the rt result displayed in "Milvus 2.1 Benchmark Test Report"

Open dzqoo opened this issue 2 years ago • 31 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:2.1.1
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2):pymilvus v2.1
- OS(Ubuntu or CentOS): centos
- CPU/Memory: 32 cores/756gb
- GPU: 
- Others:

Current Behavior

I have tested the milvus2.1.0 in sift1m dataset flowing the steps in here "https://milvus.io/docs/v2.1.x/benchmark.md" and I got a different rt and Relatively low recall. Here are my result: (1)oncall oncall@10~94% 2fd6f1775581806643577bdc2e74ac1 (2)rt avg~40ms and p99>100ms image

ps:querynode’s cpu and memory is still rich. image image

Expected Behavior

rt(p99)<60ms and oncall@10 >98%

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

dzqoo avatar Aug 18 '22 07:08 dzqoo

@dzqoo thank you for verifying the benchmark.

  1. i guess the recall is expected, as we used {M: 16, efConstruction: 500} in recall benchmark, while {M: 8, efConstruction: 200} in search benchmark.
  2. about the latency, it could be impacted by network latency, or different hardware. Also how did you test the latency? there is a blog that may help you test our benchmark quickly, please take a look: https://milvus.io/blog/2022-08-16-A-Quick-Guide-to-Benchmarking-Milvus-2-1.md

yanliang567 avatar Aug 18 '22 08:08 yanliang567

@dzqoo please feel free to share us the network latency, delpoyment configuration, hardware info, the concurrent number, and your test code(if possible) to help you get the latency in benchmark report.

/assign @dzqoo /unassign

yanliang567 avatar Aug 18 '22 08:08 yanliang567

so we split the 32 cores machine into multiple querynode right?

I guess it might make more sense to use only one querynode is working~ because for SIFT 1 M there maybe only one shard working.

xiaofan-luan avatar Aug 18 '22 08:08 xiaofan-luan

@yanliang567 Thank you for answering.

  • the network latency is ok. Both the client and server are in the same two machine. And the latency is also got from milvus metrics and I guess the network latency can not affect this;

  • Milvus is deployed under default configuration that the cpu and memory is not limit;

  • hardware info is showed in the following: image;

  • concurrent number is 400;

  • test code: `def performance(client, collection_name, search_param): index_type = client.get_index_params(collection_name) if index_type: index_type = index_type[0]['index_type'] else: index_type = 'FLAT' search_params = get_search_params(search_param, index_type) if not os.path.exists(PERFORMANCE_RESULTS_PATH): os.mkdir(PERFORMANCE_RESULTS_PATH) result_filename = collection_name + '_' + str(search_param) + '_performance.csv' performance_file = os.path.join(PERFORMANCE_RESULTS_PATH, result_filename)

    with open(performance_file, 'w+', encoding='utf-8') as f: f.write("nq,topk,total_time,avg_time" + '\n') for nq in NQ_SCOPE: query_list = get_nq_vec(nq) LOGGER.info(f"begin to search, nq = {len(query_list)}") for topk in TOPK_SCOPE: time_start = time.time() client.search_vectors(collection_name, query_list, topk, search_params) time_cost = time.time() - time_start print(nq, topk, time_cost) line = str(nq) + ',' + str(topk) + ',' + str(round(time_cost, 4)) + ',' + str( round(time_cost / nq, 4)) + '\n' f.write(line) f.write('\n') LOGGER.info("search_vec_list done !")`

dzqoo avatar Aug 18 '22 11:08 dzqoo

@dzqoo here are something that our benchmark is different from you test code, please update and retry:

  1. use hnsw index with index param {M: 8, efConstruction: 200}, search param {ef:64}
  2. use go sdk to do concurrent search instead of python sdk
  3. configure one querynode with 12core at least instead of multiple querynodes As I mentioned above, there offers some scripts in the blog that would help you test the benchmark. BTW, you can update concurrent number in go_benchmark.py(find it in the blog)

yanliang567 avatar Aug 18 '22 11:08 yanliang567

@yanliang567 I just run the office code and get the flowing result and all config is office default. [2022-08-18 19:44:48.818][ INFO] - Name # reqs # fails | Avg Min Max Median | req/s failures/s (benchmark_run.go:212:sample) [2022-08-18 19:44:48.818][ INFO] - go search 3477 0(0.00%) | 57.469 30.851 512.164 56.580 | 173.85 0.00 (benchmark_run.go:213:sample) [2022-08-18 19:45:08.818][ INFO] - Name # reqs # fails | Avg Min Max Median | req/s failures/s (benchmark_run.go:212:sample) [2022-08-18 19:45:08.818][ INFO] - go search 6952 0(0.00%) | 57.518 34.053 237.674 57.297 | 173.77 0.00 (benchmark_run.go:213:sample) [2022-08-18 19:45:28.818][ INFO] - Name # reqs # fails | Avg Min Max Median | req/s failures/s (benchmark_run.go:212:sample) [2022-08-18 19:45:28.818][ INFO] - go search 10460 0(0.00%) | 57.000 34.097 252.618 56.458 | 175.43 0.00 (benchmark_run.go:213:sample) [2022-08-18 19:45:28.818][ DEBUG] - go search run finished, parallel: 10(benchmark_run.go:95:benchmark) [2022-08-18 19:45:28.818][ INFO] - Name # reqs # fails | Avg Min Max Median | req/s failures/s (benchmark_run.go:159:samplingLoop) [2022-08-18 19:45:28.818][ INFO] - go search 10470 0(0.00%) | 57.322 30.851 512.164 56.760 | 174.40 0.00

Which is not every well...

dzqoo avatar Aug 18 '22 11:08 dzqoo

@yanliang567 image I reduce querynode num from 8 to 2 and I got this result. Here is my querynode cpu and memory config: image I guess it is still far behind the office result.

dzqoo avatar Aug 18 '22 12:08 dzqoo

@dzqoo cool, you got the benchmark scripts. As I mentioned above, now you can increase the concurrent_num in go_benchmark.py foot by foot(the default value go_benchmark.py is 10), and find the best qps and latency in certain concurrent_num. BTW, according to the report, the best concurrent is about 400. Enjoy~:)

yanliang567 avatar Aug 19 '22 00:08 yanliang567

Please keep us posted when you got the best pqs and latency. Also please share the grafana metrics of milvus if convenient. thanks again. @dzqoo

yanliang567 avatar Aug 19 '22 01:08 yanliang567

@yanliang567 I wonder how many querynode is configed under the office benchmark test and how the replicas is configed.

dzqoo avatar Aug 19 '22 01:08 dzqoo

@yanliang567 I wonder how many querynode is configed under the office benchmark test and how the replicas is configed.

one querynode, one replica

yanliang567 avatar Aug 19 '22 01:08 yanliang567

This is amazing! Mine test has two query node in 20 cores host machine and has one replica, but mine test result really can not reach so good performance. I don't know why.

dzqoo avatar Aug 19 '22 02:08 dzqoo

I have another question. When I increase the nq and the qps is significant decreases.Here are the result:

  • when nq = 1, I got this -> qps =124 image

  • when nq = 10, I got this -> qps = 27.69 image Other config is keepping the same. So what can I do when I increase the nq in production scene. @yanliang567 Look forwarding your apply.

dzqoo avatar Aug 19 '22 02:08 dzqoo

As there is only 1 million vectors, only 1 or 2 segments would be generated. so 2 querynodes may not have better performance. try this

  1. use one shard when create the collection
  2. wait until build index completed, and the load collection, to ensure all searches are running on index (try release and load collection to ensure this)

yanliang567 avatar Aug 19 '22 02:08 yanliang567

@dzqoo increasing nq would cause latency decrease, and qps would be decrease as well, this is expected. In production, you can now try multiple replicas to increase qps while keeping latency roughly stable.
BTW, I notice that the last result you pasted has the different search params, and all these params would impact the performance.

  1. dim: 128->400
  2. dataset: 1m->10m
  3. metric_type: L2->IP
  4. ef: 64->256

yanliang567 avatar Aug 19 '22 02:08 yanliang567

@yanliang567 Yes ,I have tested the difference in my datatset. I have notice that My replicas group num keeps 2 when I increase replicas. And I wonder it's normal? Here is the ouput: image def load_balance_in_one_group(self,collection_name): try: milvus_sys = MilvusSys() querynode_num = len(milvus_sys.query_nodes()) if(querynode_num < 3): LOGGER.warn("skip load balance for multi replicas case when querynode number less than 3") sys.exit(1) self.set_collection(collection_name=collection_name) res = self.collection.get_replicas() print(res.groups) group_nodes = [] for g in res.groups: if len(g.group_nodes) >= 2: group_nodes = list(g.group_nodes) break res = utility.get_query_segment_info(collection_name=collection_name) print(group_nodes) print(res) except Exception as e: LOGGER.error("Failed to balance : {}".format(e)) sys.exit(1)

dzqoo avatar Aug 19 '22 03:08 dzqoo

@dzqoo this is expected as you only have 2 querynodes. In short, replica_number <= querynode_number

yanliang567 avatar Aug 19 '22 04:08 yanliang567

@yanliang567 I increase querynode num from 2 to 9, the replicas group is still 2. Here is the result image

dzqoo avatar Aug 19 '22 06:08 dzqoo

how did you increase the replica? did yo do like this?

  1. collection.release()
  2. collection.load(replicas=4)

yanliang567 avatar Aug 19 '22 06:08 yanliang567

@yanliang567 Yes ,I did this. Here is the code: REPLICA_NUMBER = 5 def load_data(self, collection_name): # load data from disk to try: self.set_collection(collection_name) self.collection.load(replica_number=REPLICA_NUMBER) except Exception as e: LOGGER.error(f"Failed load data: {e}") sys.exit(1)

dzqoo avatar Aug 19 '22 07:08 dzqoo

@dzqoo I mean you have to release and reload collection to change the replica number. did you release collection

yanliang567 avatar Aug 19 '22 07:08 yanliang567

@yanliang567 surely

dzqoo avatar Aug 19 '22 07:08 dzqoo

Mmm... then we have to look into the logs. @dzqoo Could you please refer this script to export the whole Milvus logs for investigation?

yanliang567 avatar Aug 19 '22 07:08 yanliang567

@yanliang567 Here are the all logs exported by office script.Please have a look.Thank you~ logs.tar.gz

dzqoo avatar Aug 19 '22 08:08 dzqoo

/assign @jiaoew1991 /unassign

yanliang567 avatar Aug 19 '22 10:08 yanliang567

@dzqoo just curious, did you get the benchmark report result if using the same dataset and configurations as benchmark report?

yanliang567 avatar Aug 19 '22 10:08 yanliang567

@yanliang567 I did this also, BUT the result is still not that good. AS for the unchanged replicas group , do you have any suggestions?

dzqoo avatar Aug 22 '22 01:08 dzqoo

hi @dzqoo , from querycoord's logs we can find the replica number is 5, you can check it by this file replica.log,

@yanliang567 I increase querynode num from 2 to 9, the replicas group is still 2. Here is the result image

Also to answer the [65, 56] that appears in the log is a node group, not a replica group, which means these two node combined together to become a replica

Also, from the log, the shard number is still 2, not 1

jiaoew1991 avatar Aug 22 '22 02:08 jiaoew1991

@jiaoew1991 Emmm,I got that.Thank you for answering. I will chang the shard number and retest.

dzqoo avatar Aug 22 '22 03:08 dzqoo

@jiaoew1991 Emmm,I got that.Thank you for answering. I will chang the shard number and retest.

@dzqoo actually i don't think you have to update shard number to 1, because you have more than 1 million vectors. We set shard=1 only when running perf on 1 million dataset.

yanliang567 avatar Aug 22 '22 03:08 yanliang567