Mooncake [WIP]feat(rdma): add parallel memory region registration support

Summary

This PR introduces a configurable parallel memory region registration feature with significant performance improvements for pre-allocated memory scenarios, while maintaining backward compatibility.

I conducted several tests to validate performance (test code is also attached). perf.tar.gz

Test Configuration

Memory Size: 500GB (DRAM)
Test Scenarios:
- Pre-allocated memory (memory allocated and initialized)
- Non-pre-allocated memory (memory allocated but not initialized)
Configuration Options:
- With Optimization: MC_DISABLE_PARALLEL_REG_MR not set (parallel registration enabled)
- Without Optimization: MC_DISABLE_PARALLEL_REG_MR=1 (sequential registration)

Performance Results

Pre-allocated Memory Scenario

Operation	With Optimization (Parallel)	Without Optimization (Sequential)	Performance Improvement
allocate_memory	3.557 seconds	3.810 seconds	N/A
register_memory	11.662 seconds	49.198 seconds	321.9% faster
unregister_memory	1.459 seconds	11.754 seconds	705.4% faster
Total Time	17.353 seconds	65.399 seconds	276.8% faster

Non-pre-allocated Memory Scenario

Operation	With Optimization (Parallel)	Without Optimization (Sequential)	Performance Improvement
allocate_memory	0.000 seconds	0.001 seconds	N/A
register_memory	461.999 seconds	86.930 seconds	-431.3% slower
unregister_memory	1.822 seconds	11.498 seconds	531.0% faster
Total Time	464.469 seconds	98.945 seconds	-369.4% slower

I had the AI summarize and analyze the test results. Below is the AI's output:

Key Performance Findings

Pre-allocated Memory (500GB)

Memory registration: 4.2x faster (11.7s → 49.2s)
Memory unregistration: 8.1x faster (1.5s → 11.8s)
Total operation time: 2.8x faster (17.4s → 65.4s)

Non-pre-allocated Memory (500GB)

Memory registration: 5.3x slower (462s → 87s)
Memory unregistration: 6.3x faster (1.8s → 11.5s)
Total operation time: 3.7x slower (464s → 99s)

Analysis

Pre-allocated memory benefits from parallel registration because:

Memory is already pinned and in physical memory
Multiple RDMA contexts can be utilized simultaneously
Better CPU core utilization

Non-pre-allocated memory performs better with sequential registration because:

Reduces memory paging overhead and I/O contention
Avoids kernel-level resource conflicts during large allocations

Conclusion

The parallel memory registration optimization provides significant performance benefits for pre-allocated memory scenarios, with up to 8x improvement in unregistration performance. However, for large non-pre-allocated memory allocations, sequential registration performs better due to reduced resource contention and kernel overhead.

The MC_PARALLEL_REG_MR configuration option provides the flexibility to choose the optimal strategy based on the specific use case and memory allocation patterns of the application.

Sep 17 '25 13:09 staryxchen

Related issue: #848

Sep 17 '25 13:09 staryxchen

The zip file you provided seems to be empty?

Sep 18 '25 02:09 xiaguan

The zip file you provided seems to be empty?

Sry, I've updated the file. Please try again.

Sep 18 '25 02:09 staryxchen

Hi @xiaguan Since this patch may cause negative effects (if register memory is not pre-allocated), I have disabled this optimization by default. BTW, my tests show that pre-allocating memory via touch-read does not eliminate the negative optimization caused by parallel register MR. The bottleneck appears to stem from pin memory. Could you further verify the optimization effect of this patch when combined with the Mooncake Store?

Sep 19 '25 07:09 staryxchen

Sure, I'll give it a try. I'll share the results later.

Sep 22 '25 02:09 xiaguan

In the simple dual-NIC test setup, registration speed doesn't seem to improve with or without pre-allocation—there's actually a bit of regression.

Not sure how it performs on 8 NICs yet. I'll test it once I get access to such a machine. In the meantime, feel free to keep this PR open.

Sep 22 '25 08:09 xiaguan

In the simple dual-NIC test setup, registration speed doesn't seem to improve with or without pre-allocation—there's actually a bit of regression.

Not sure how it performs on 8 NICs yet. I'll test it once I get access to such a machine. In the meantime, feel free to keep this PR open.

What is the size of the registered memory in your test?

Sep 22 '25 08:09 staryxchen

In the simple dual-NIC test setup, registration speed doesn't seem to improve with or without pre-allocation—there's actually a bit of regression. Not sure how it performs on 8 NICs yet. I'll test it once I get access to such a machine. In the meantime, feel free to keep this PR open.

What is the size of the registered memory in your test?

(40GB, 4GB)

Sep 22 '25 08:09 xiaguan

In the simple dual-NIC test setup, registration speed doesn't seem to improve with or without pre-allocation—there's actually a bit of regression. Not sure how it performs on 8 NICs yet. I'll test it once I get access to such a machine. In the meantime, feel free to keep this PR open.

What is the size of the registered memory in your test?

(40GB, 4GB)

I think the size is not enough to show the improvements of this patch. Could we try a larger capacity, like 400GB?

Sep 22 '25 08:09 staryxchen

8nic, 200GB without pre alloc default

I0922 08:58:12.550942 32380 rdma_transport.cpp:143] Memory registration took 71332.3 ms

with this pr

I0922 08:56:06.719657 30195 rdma_transport.cpp:143] Memory registration took 420163 ms

pre alloc default

I0922 09:01:26.652885 33920 rdma_transport.cpp:143] Memory registration took 29864.5 ms

with this pr

I0922 09:04:40.057971 34793 rdma_transport.cpp:143] Memory registration took 81101.3 ms

Sep 22 '25 09:09 xiaguan

@staryxchen You need to fix merge conflicts. In addition, maybe you can refactor registerLocalMemoryBatch as well, because it also performs batched register.

Nov 11 '25 02:11 alogfans

@staryxchen You need to fix merge conflicts. In addition, maybe you can refactor registerLocalMemoryBatch as well, because it also performs batched register.

This PR can be pending, as it may conflict with other optimization approaches (such as pre-allocating memory).

Nov 11 '25 02:11 staryxchen

@staryxchen JFYI, Concurrent register memory is an important optimization. CC. @alogfans

Nov 30 '25 16:11 stmatengss