[WIP]feat(rdma): add parallel memory region registration support
Summary
This PR introduces a configurable parallel memory region registration feature with significant performance improvements for pre-allocated memory scenarios, while maintaining backward compatibility.
I conducted several tests to validate performance (test code is also attached). perf.tar.gz
Test Configuration
- Memory Size: 500GB (DRAM)
- Test Scenarios:
- Pre-allocated memory (memory allocated and initialized)
- Non-pre-allocated memory (memory allocated but not initialized)
- Configuration Options:
- With Optimization:
MC_DISABLE_PARALLEL_REG_MRnot set (parallel registration enabled) - Without Optimization:
MC_DISABLE_PARALLEL_REG_MR=1(sequential registration)
- With Optimization:
Performance Results
Pre-allocated Memory Scenario
| Operation | With Optimization (Parallel) | Without Optimization (Sequential) | Performance Improvement |
|---|---|---|---|
| allocate_memory | 3.557 seconds | 3.810 seconds | N/A |
| register_memory | 11.662 seconds | 49.198 seconds | 321.9% faster |
| unregister_memory | 1.459 seconds | 11.754 seconds | 705.4% faster |
| Total Time | 17.353 seconds | 65.399 seconds | 276.8% faster |
Non-pre-allocated Memory Scenario
| Operation | With Optimization (Parallel) | Without Optimization (Sequential) | Performance Improvement |
|---|---|---|---|
| allocate_memory | 0.000 seconds | 0.001 seconds | N/A |
| register_memory | 461.999 seconds | 86.930 seconds | -431.3% slower |
| unregister_memory | 1.822 seconds | 11.498 seconds | 531.0% faster |
| Total Time | 464.469 seconds | 98.945 seconds | -369.4% slower |
I had the AI summarize and analyze the test results. Below is the AI's output:
Key Performance Findings
Pre-allocated Memory (500GB)
- Memory registration: 4.2x faster (11.7s → 49.2s)
- Memory unregistration: 8.1x faster (1.5s → 11.8s)
- Total operation time: 2.8x faster (17.4s → 65.4s)
Non-pre-allocated Memory (500GB)
- Memory registration: 5.3x slower (462s → 87s)
- Memory unregistration: 6.3x faster (1.8s → 11.5s)
- Total operation time: 3.7x slower (464s → 99s)
Analysis
Pre-allocated memory benefits from parallel registration because:
- Memory is already pinned and in physical memory
- Multiple RDMA contexts can be utilized simultaneously
- Better CPU core utilization
Non-pre-allocated memory performs better with sequential registration because:
- Reduces memory paging overhead and I/O contention
- Avoids kernel-level resource conflicts during large allocations
Conclusion
The parallel memory registration optimization provides significant performance benefits for pre-allocated memory scenarios, with up to 8x improvement in unregistration performance. However, for large non-pre-allocated memory allocations, sequential registration performs better due to reduced resource contention and kernel overhead.
The MC_PARALLEL_REG_MR configuration option provides the flexibility to choose the optimal strategy based on the specific use case and memory allocation patterns of the application.
Related issue: #848
The zip file you provided seems to be empty?
The zip file you provided seems to be empty?
Sry, I've updated the file. Please try again.
Hi @xiaguan Since this patch may cause negative effects (if register memory is not pre-allocated), I have disabled this optimization by default. BTW, my tests show that pre-allocating memory via touch-read does not eliminate the negative optimization caused by parallel register MR. The bottleneck appears to stem from pin memory. Could you further verify the optimization effect of this patch when combined with the Mooncake Store?
Sure, I'll give it a try. I'll share the results later.
In the simple dual-NIC test setup, registration speed doesn't seem to improve with or without pre-allocation—there's actually a bit of regression.
Not sure how it performs on 8 NICs yet. I'll test it once I get access to such a machine. In the meantime, feel free to keep this PR open.
In the simple dual-NIC test setup, registration speed doesn't seem to improve with or without pre-allocation—there's actually a bit of regression.
Not sure how it performs on 8 NICs yet. I'll test it once I get access to such a machine. In the meantime, feel free to keep this PR open.
What is the size of the registered memory in your test?
In the simple dual-NIC test setup, registration speed doesn't seem to improve with or without pre-allocation—there's actually a bit of regression. Not sure how it performs on 8 NICs yet. I'll test it once I get access to such a machine. In the meantime, feel free to keep this PR open.
What is the size of the registered memory in your test?
(40GB, 4GB)
In the simple dual-NIC test setup, registration speed doesn't seem to improve with or without pre-allocation—there's actually a bit of regression. Not sure how it performs on 8 NICs yet. I'll test it once I get access to such a machine. In the meantime, feel free to keep this PR open.
What is the size of the registered memory in your test?
(40GB, 4GB)
I think the size is not enough to show the improvements of this patch. Could we try a larger capacity, like 400GB?
8nic, 200GB without pre alloc default
I0922 08:58:12.550942 32380 rdma_transport.cpp:143] Memory registration took 71332.3 ms
with this pr
I0922 08:56:06.719657 30195 rdma_transport.cpp:143] Memory registration took 420163 ms
pre alloc default
I0922 09:01:26.652885 33920 rdma_transport.cpp:143] Memory registration took 29864.5 ms
with this pr
I0922 09:04:40.057971 34793 rdma_transport.cpp:143] Memory registration took 81101.3 ms
@staryxchen You need to fix merge conflicts. In addition, maybe you can refactor registerLocalMemoryBatch as well, because it also performs batched register.
@staryxchen You need to fix merge conflicts. In addition, maybe you can refactor
registerLocalMemoryBatchas well, because it also performs batched register.
This PR can be pending, as it may conflict with other optimization approaches (such as pre-allocating memory).
@staryxchen JFYI, Concurrent register memory is an important optimization. CC. @alogfans