Mooncake icon indicating copy to clipboard operation
Mooncake copied to clipboard

[WIP]feat(rdma): add parallel memory region registration support

Open staryxchen opened this issue 3 months ago • 13 comments

Summary

This PR introduces a configurable parallel memory region registration feature with significant performance improvements for pre-allocated memory scenarios, while maintaining backward compatibility.

I conducted several tests to validate performance (test code is also attached). perf.tar.gz

Test Configuration

  • Memory Size: 500GB (DRAM)
  • Test Scenarios:
    • Pre-allocated memory (memory allocated and initialized)
    • Non-pre-allocated memory (memory allocated but not initialized)
  • Configuration Options:
    • With Optimization: MC_DISABLE_PARALLEL_REG_MR not set (parallel registration enabled)
    • Without Optimization: MC_DISABLE_PARALLEL_REG_MR=1 (sequential registration)

Performance Results

Pre-allocated Memory Scenario

Operation With Optimization (Parallel) Without Optimization (Sequential) Performance Improvement
allocate_memory 3.557 seconds 3.810 seconds N/A
register_memory 11.662 seconds 49.198 seconds 321.9% faster
unregister_memory 1.459 seconds 11.754 seconds 705.4% faster
Total Time 17.353 seconds 65.399 seconds 276.8% faster

Non-pre-allocated Memory Scenario

Operation With Optimization (Parallel) Without Optimization (Sequential) Performance Improvement
allocate_memory 0.000 seconds 0.001 seconds N/A
register_memory 461.999 seconds 86.930 seconds -431.3% slower
unregister_memory 1.822 seconds 11.498 seconds 531.0% faster
Total Time 464.469 seconds 98.945 seconds -369.4% slower

I had the AI summarize and analyze the test results. Below is the AI's output:

Key Performance Findings

Pre-allocated Memory (500GB)

  • Memory registration: 4.2x faster (11.7s → 49.2s)
  • Memory unregistration: 8.1x faster (1.5s → 11.8s)
  • Total operation time: 2.8x faster (17.4s → 65.4s)

Non-pre-allocated Memory (500GB)

  • Memory registration: 5.3x slower (462s → 87s)
  • Memory unregistration: 6.3x faster (1.8s → 11.5s)
  • Total operation time: 3.7x slower (464s → 99s)

Analysis

Pre-allocated memory benefits from parallel registration because:

  • Memory is already pinned and in physical memory
  • Multiple RDMA contexts can be utilized simultaneously
  • Better CPU core utilization

Non-pre-allocated memory performs better with sequential registration because:

  • Reduces memory paging overhead and I/O contention
  • Avoids kernel-level resource conflicts during large allocations

Conclusion

The parallel memory registration optimization provides significant performance benefits for pre-allocated memory scenarios, with up to 8x improvement in unregistration performance. However, for large non-pre-allocated memory allocations, sequential registration performs better due to reduced resource contention and kernel overhead.

The MC_PARALLEL_REG_MR configuration option provides the flexibility to choose the optimal strategy based on the specific use case and memory allocation patterns of the application.

staryxchen avatar Sep 17 '25 13:09 staryxchen

Related issue: #848

staryxchen avatar Sep 17 '25 13:09 staryxchen

The zip file you provided seems to be empty?

xiaguan avatar Sep 18 '25 02:09 xiaguan

The zip file you provided seems to be empty?

Sry, I've updated the file. Please try again.

staryxchen avatar Sep 18 '25 02:09 staryxchen

Hi @xiaguan Since this patch may cause negative effects (if register memory is not pre-allocated), I have disabled this optimization by default. BTW, my tests show that pre-allocating memory via touch-read does not eliminate the negative optimization caused by parallel register MR. The bottleneck appears to stem from pin memory. Could you further verify the optimization effect of this patch when combined with the Mooncake Store?

staryxchen avatar Sep 19 '25 07:09 staryxchen

Sure, I'll give it a try. I'll share the results later.

xiaguan avatar Sep 22 '25 02:09 xiaguan

In the simple dual-NIC test setup, registration speed doesn't seem to improve with or without pre-allocation—there's actually a bit of regression.

Not sure how it performs on 8 NICs yet. I'll test it once I get access to such a machine. In the meantime, feel free to keep this PR open.

xiaguan avatar Sep 22 '25 08:09 xiaguan

In the simple dual-NIC test setup, registration speed doesn't seem to improve with or without pre-allocation—there's actually a bit of regression.

Not sure how it performs on 8 NICs yet. I'll test it once I get access to such a machine. In the meantime, feel free to keep this PR open.

What is the size of the registered memory in your test?

staryxchen avatar Sep 22 '25 08:09 staryxchen

In the simple dual-NIC test setup, registration speed doesn't seem to improve with or without pre-allocation—there's actually a bit of regression. Not sure how it performs on 8 NICs yet. I'll test it once I get access to such a machine. In the meantime, feel free to keep this PR open.

What is the size of the registered memory in your test?

(40GB, 4GB)

xiaguan avatar Sep 22 '25 08:09 xiaguan

In the simple dual-NIC test setup, registration speed doesn't seem to improve with or without pre-allocation—there's actually a bit of regression. Not sure how it performs on 8 NICs yet. I'll test it once I get access to such a machine. In the meantime, feel free to keep this PR open.

What is the size of the registered memory in your test?

(40GB, 4GB)

I think the size is not enough to show the improvements of this patch. Could we try a larger capacity, like 400GB?

staryxchen avatar Sep 22 '25 08:09 staryxchen

8nic, 200GB without pre alloc default

I0922 08:58:12.550942 32380 rdma_transport.cpp:143] Memory registration took 71332.3 ms

with this pr

I0922 08:56:06.719657 30195 rdma_transport.cpp:143] Memory registration took 420163 ms

pre alloc default

I0922 09:01:26.652885 33920 rdma_transport.cpp:143] Memory registration took 29864.5 ms

with this pr

I0922 09:04:40.057971 34793 rdma_transport.cpp:143] Memory registration took 81101.3 ms

xiaguan avatar Sep 22 '25 09:09 xiaguan

@staryxchen You need to fix merge conflicts. In addition, maybe you can refactor registerLocalMemoryBatch as well, because it also performs batched register.

alogfans avatar Nov 11 '25 02:11 alogfans

@staryxchen You need to fix merge conflicts. In addition, maybe you can refactor registerLocalMemoryBatch as well, because it also performs batched register.

This PR can be pending, as it may conflict with other optimization approaches (such as pre-allocating memory).

staryxchen avatar Nov 11 '25 02:11 staryxchen

@staryxchen JFYI, Concurrent register memory is an important optimization. CC. @alogfans

stmatengss avatar Nov 30 '25 16:11 stmatengss