mgpusim icon indicating copy to clipboard operation
mgpusim copied to clipboard

No RDMA traffic count when `-use-unified-memory` is enabled

Open ch1y0q opened this issue 3 months ago • 3 comments

To Reproduce MGPUSim version of commit ID: v4.1.4 https://github.com/sarchlab/mgpusim/commit/4277061dd690f72c633d5e7fc392bb7690e8ede0

Command that recreates the problem

./bitonicsort -timing -use-unified-memory --report-rdma-transaction-count -unified-gpus 1,2,3,4 -trace-mem

where bitonicsort can be replaced with other test cases under amd/samples.

Current behavior RDMA outgoing, ingoing count on all GPUs are 0.

Expected behavior As the command is simulating multi-GPU with unified memory access enabled, we expect to observe non-zero count value on RDMA traffic.

Screenshots

Image

Additional context

  1. RDMA-based remote access typically operates at ​cache-line granularity, while ​page migration​ uses page-sized granularity. The current simulation does not explicitly report page migration counts or provide options to switch between these modes. Is this functionality implemented but undocumented, or is it absent?
  2. Additionally, it seems that CPU-GPU unified memory page fault may not be fully modeled in MGPUSim. Did I overlooked some configurations, or is it not implemented intentionally?

ch1y0q avatar Sep 21 '25 05:09 ch1y0q

@ch1y0q You are right in the additional context point 1. The RDMA-based access is at cache-line granularity. So, they do not involve page migration. Since you turned on -unified-memory, which means (currently, in MGPUSim context) use page migration and do not use GPU-GPU cache-line-level access. That is why you see 0 transactions in RDMA. What if you turn off -unified-memory? The default method fully relies on cache-line access. For now, I do not recommend using the unified memory feature of MGPUSim as it is very buggy. We are in the process of reimplimenting the feature.

For CPU-GPU unified memory page fault, you are right; MGPUSim does not support it. This is mainly because MGPUSim does not have a CPU model. However, we used to play a trick that we allocate all the memory initially on GPU 1 and only perform computing on GPU 2,3,4,5, which should yield a similar simulation result.

Please let me know if you can see RDMA transactions if you turn off -unified-memory option.

syifan avatar Sep 22 '25 13:09 syifan

Thanks for your prompt reply and guidance, @syifan.

After removing the -use-unified-memoryoption as suggested, I can now observe non-zero RDMA transaction counts, which aligns with the expected behavior of cache-line granularity remote access in multi-GPU simulations. This confirms that the RDMA transaction monitoring functionality is working correctly when not using the unified memory mode.

Image

Regarding the ​page migration functionality, I understand from your response that the current implementation is unstable and under redevelopment. For my research on page migration mechanisms in multi-GPU systems, it would be very helpful to know if there was ​a prior, more stable version or branch of MGPUSim​ that included a functional page migration engine? I have briefly checked the initial release (v3.0.0) available on GitHub but did not find relevant calls (e.g., NewPageMigrationRspToDriver, NewPageMigrationReqToCP). Any pointers to a specific commit, branch, or alternative approach would be greatly appreciated.

ch1y0q avatar Sep 23 '25 07:09 ch1y0q

@ch1y0q v3.0.0 should be a reasonable version that has a relatively stable unified-memory implementation.

Option 2 is to go back to the original version of MGPUSim on GitLab https://gitlab.com/akita/mgpusim.

Option 3: If you do not care about the number of benchmarks to run, you can try the most recent release, run each workload, and see which one works. There should still be a good number of benchmarks that can still run.

syifan avatar Sep 23 '25 15:09 syifan