Mooncake [Performance]: Why does the bandwidth first increase and then decrease as the batch size increases?

Describe your performance question

As shown in the table, when running ./transfer_engine_bench on two H20 servers, the bandwidth initially increases and then decreases as the batch size increases.

The experimental details are as follows: export MC_WORKERS_PER_CTX=1

export MC_MAX_WR=2048

./transfer_engine_bench --mode=target --metadata_server=etcd://xxx --local_server_name=abcd:12345 --device_name=mlx5_bond_1 --threads=8 --block_size=8192

./transfer_engine_bench --metadata_server=etcd://xxx --mode=initiator --segment_id=abcd:12345 --device_name=mlx5_bond_1 --threads=8 --batch_size=xx --block_size=8192

Before submitting a new issue...

[x] Make sure you already searched for relevant issues and read the documentation

Nov 22 '25 03:11 hzt123123

I think this is because the time complexity of the code below is O(n^2 * m), where n is the batch size. I would like to know why it is designed this way. From my understanding, the time complexity of polling the CQ should only be O(n * m).

Nov 23 '25 04:11 hzt123123

@hzt123123 We provide an API getTransferStatus(batch_id, status) to aggregate the result of all tasks.

In addition, as you can see in this code, we will wait for all tasks are completed before submitting next batch of tasks (rather than pipelining). So excessive tasks may lead the main thread to wait for more time.

Nov 24 '25 01:11 alogfans

Why is getBatchTransferStatus included in the API getTransferStatus?

Nov 28 '25 06:11 hzt123123

Why is getBatchTransferStatus included in the API getTransferStatus?

TE Transfer is a batch-based API, so we need to retrieve the status of the entire batch.

Dec 01 '25 05:12 stmatengss