No RDMA traffic count when `-use-unified-memory` is enabled
To Reproduce MGPUSim version of commit ID: v4.1.4 https://github.com/sarchlab/mgpusim/commit/4277061dd690f72c633d5e7fc392bb7690e8ede0
Command that recreates the problem
./bitonicsort -timing -use-unified-memory --report-rdma-transaction-count -unified-gpus 1,2,3,4 -trace-mem
where bitonicsort can be replaced with other test cases under amd/samples.
Current behavior RDMA outgoing, ingoing count on all GPUs are 0.
Expected behavior As the command is simulating multi-GPU with unified memory access enabled, we expect to observe non-zero count value on RDMA traffic.
Screenshots
Additional context
- RDMA-based remote access typically operates at cache-line granularity, while page migration uses page-sized granularity. The current simulation does not explicitly report page migration counts or provide options to switch between these modes. Is this functionality implemented but undocumented, or is it absent?
- Additionally, it seems that CPU-GPU unified memory page fault may not be fully modeled in MGPUSim. Did I overlooked some configurations, or is it not implemented intentionally?
@ch1y0q You are right in the additional context point 1. The RDMA-based access is at cache-line granularity. So, they do not involve page migration. Since you turned on -unified-memory, which means (currently, in MGPUSim context) use page migration and do not use GPU-GPU cache-line-level access. That is why you see 0 transactions in RDMA. What if you turn off -unified-memory? The default method fully relies on cache-line access. For now, I do not recommend using the unified memory feature of MGPUSim as it is very buggy. We are in the process of reimplimenting the feature.
For CPU-GPU unified memory page fault, you are right; MGPUSim does not support it. This is mainly because MGPUSim does not have a CPU model. However, we used to play a trick that we allocate all the memory initially on GPU 1 and only perform computing on GPU 2,3,4,5, which should yield a similar simulation result.
Please let me know if you can see RDMA transactions if you turn off -unified-memory option.
Thanks for your prompt reply and guidance, @syifan.
After removing the -use-unified-memoryoption as suggested, I can now observe non-zero RDMA transaction counts, which aligns with the expected behavior of cache-line granularity remote access in multi-GPU simulations. This confirms that the RDMA transaction monitoring functionality is working correctly when not using the unified memory mode.
Regarding the page migration functionality, I understand from your response that the current implementation is unstable and under redevelopment. For my research on page migration mechanisms in multi-GPU systems, it would be very helpful to know if there was a prior, more stable version or branch of MGPUSim that included a functional page migration engine? I have briefly checked the initial release (v3.0.0) available on GitHub but did not find relevant calls (e.g., NewPageMigrationRspToDriver, NewPageMigrationReqToCP). Any pointers to a specific commit, branch, or alternative approach would be greatly appreciated.
@ch1y0q v3.0.0 should be a reasonable version that has a relatively stable unified-memory implementation.
Option 2 is to go back to the original version of MGPUSim on GitLab https://gitlab.com/akita/mgpusim.
Option 3: If you do not care about the number of benchmarks to run, you can try the most recent release, run each workload, and see which one works. There should still be a good number of benchmarks that can still run.