Enable 1GB Hugepage Support
Description
:memo: Please include a summary of the change
Add support for allocating 1 GB hugepages when the requested size exceeds 1 GB. If the 1 GB hugepage allocation fails, the allocator automatically falls back to the default 2 MB hugepages.
The threshold to trigger using a 1 GB hugepage can be adjusted to a smaller size for experimentation, with the benefit of reducing latency (please see the attached test results).
Type of change
- [ ] Bug fix
- [ ] New feature
- [ ] Documentation update
- [ ] A new research paper code implementation
- [x] Other
Tests & Results
:memo: Please describe the tests that you ran to verify your changes. Also include any numerical results (throughput/latency etc.) relavant to the change.
Latency benchmark on example 1:
with 1 GB hugepage:
-- CLI PARAMETERS:
Enable hugepages: 1 Enable mapped pages: 1 Data stream: HOST Number of test runs: 50 Starting transfer size: 64 Ending transfer size: 4194304
-- PERF LOCAL
Size: 64; Average throughput: 98.2016 MB/s; Average latency: 3.15146 us Size: 128; Average throughput: 195.822 MB/s; Average latency: 3.1738 us Size: 256; Average throughput: 390.318 MB/s; Average latency: 3.161 us Size: 512; Average throughput: 770.096 MB/s; Average latency: 3.2823 us Size: 1024; Average throughput: 1490.58 MB/s; Average latency: 3.408 us Size: 2048; Average throughput: 2883.6 MB/s; Average latency: 3.48878 us Size: 4096; Average throughput: 5334.62 MB/s; Average latency: 3.86836 us Size: 8192; Average throughput: 7829.08 MB/s; Average latency: 4.22588 us Size: 16384; Average throughput: 10160.2 MB/s; Average latency: 4.76938 us Size: 32768; Average throughput: 11526.1 MB/s; Average latency: 6.06056 us Size: 65536; Average throughput: 11839.4 MB/s; Average latency: 8.63796 us Size: 131072; Average throughput: 12013.5 MB/s; Average latency: 13.7449 us Size: 262144; Average throughput: 12102.4 MB/s; Average latency: 24.0373 us Size: 524288; Average throughput: 12153 MB/s; Average latency: 44.5359 us Size: 1048576; Average throughput: 12173.3 MB/s; Average latency: 85.4818 us Size: 2097152; Average throughput: 12162.3 MB/s; Average latency: 167.375 us Size: 4194304; Average throughput: 12196.9 MB/s; Average latency: 331.187 us
without 1 GB hugepages (original one)
-- CLI PARAMETERS:
Enable hugepages: 1 Enable mapped pages: 1 Data stream: HOST Number of test runs: 50 Starting transfer size: 64 Ending transfer size: 4194304
-- PERF LOCAL
Size: 64; Average throughput: 85.5534 MB/s; Average latency: 3.6574 us Size: 128; Average throughput: 170.844 MB/s; Average latency: 3.71552 us Size: 256; Average throughput: 337.174 MB/s; Average latency: 3.7207 us Size: 512; Average throughput: 643.109 MB/s; Average latency: 3.8669 us Size: 1024; Average throughput: 1221.19 MB/s; Average latency: 3.9219 us Size: 2048; Average throughput: 2151.89 MB/s; Average latency: 4.0968 us Size: 4096; Average throughput: 3919.63 MB/s; Average latency: 4.32658 us Size: 8192; Average throughput: 5471.77 MB/s; Average latency: 5.0043 us Size: 16384; Average throughput: 7409.5 MB/s; Average latency: 5.4727 us Size: 32768; Average throughput: 8667.55 MB/s; Average latency: 6.81816 us Size: 65536; Average throughput: 9308.38 MB/s; Average latency: 9.80982 us Size: 131072; Average throughput: 9781.01 MB/s; Average latency: 15.8872 us Size: 262144; Average throughput: 9968.8 MB/s; Average latency: 28.0092 us Size: 524288; Average throughput: 10087.6 MB/s; Average latency: 52.1635 us Size: 1048576; Average throughput: 10196.1 MB/s; Average latency: 100.466 us Size: 2097152; Average throughput: 10033.8 MB/s; Average latency: 197.301 us Size: 4194304; Average throughput: 10293.4 MB/s; Average latency: 391.782 us
Checklist
- [ ] I have commented my code and made corresponding changes to the documentation.
- [x] I have added tests/results that prove my fix is effective or that my feature works.
- [ ] My changes generate no new warnings or errors & all tests successfully pass.
Hi @HongshiTan - thank your for this contribution! We've also been using huge-pages in Coyote, but mostly work-around and never supported out-of-the-box.
A couple of questions/suggestions on my side:
-
With 2MB huge-pages, do you know why the throughput saturates at ~10 GBps? When we run the same example we get around ~12 GBps - are there some NUMA effects.
-
Rather than trying to allocate 1GB pages and falling back in case it fails, we should check what the hardware was built for; e.g., even if we allocate 1 GB huge-pages, but the MMU/TLB in the vFPGA is built with 2 MB support, there will not be a lot of improvement in terms of TLB misses. There is a parameter called TLBL_BITS in CMake configuration (https://github.com/fpgasystems/Coyote/blob/0fc0bf9725a01898bc28fd0560add23967c6fcc6/cmake/FindCoyoteHW.cmake#L119). Would it make sense to ask the driver (via an IOCTL call), what the large TLB size is (it's stored in a variable in the vFPGA char device) and depending on the driver response, allocated according huge pages (e.g., returns 1 GB, then we allocate 1 GB).
That makes sense. I double-checked the performance, and the difference comes from cross-NUMA access. I will add a script to make NUMA-bound execution easier, as well as huge-page allocation based on TLBL_BITS.
I mainly updated the logic for determining Coyote’s huge-page size. Instead of using the existing config from the CMake configuration, I found that the shell already exposes a register that provides this information. I now read this register directly from hardware and pass the value to user space through the existing IOCTL_READ_SHELL_CONFIG.
If this is not OK for your side, please let me know your consideration, and I can switch back to using the CMake configuration. I prefer getting the value from hardware because it ensures alignment with the actual bitstream configuration and provides some adaptability to hardware changes.
Hey Hongshi, thanks for this change! I've reviewed the code and it looks mostly good to me. Also thank you for the scripts and the documentation for them.
I am just thinking a bit about fail-safe mechanisms in this code - intuitively, it would make the most sense to match the huge page allocation on the CPU to whatever the large TLB in Coyote supports and fail otherwise. That is, if Coyote is built with 1 GB hugepages, we will allocate a 1 GB huge on the host and if it fails, we throw an error to the user. This would avoid any iconsistencies between the host and hardware (which probably lead to weird corner cases). Similarly, I don't think we should have a compile-time flag (simply because people then often forget it exists) - this should be "natural" behaviour at run-time.
So, if possible, the minor change would be:
- When allocating huge page, query how many bits is the lTLB built with
- Try to allocate with that many bits on the host (if it's 1 GB then 1 GB, if it's 2 MB, then 2 MB and if it's some obscure number, then we try with that and if the OS fails it fails)
- If it fails, we throw the error back to the user.
What do you think about this? Does it make sense?
@JonasDann do we need to make any change in the simulation environment, to ensure consistency? Or does the simulation always assume 2MB hugepages?
@bo3z I don't think this has any implications on the simulation. For simplicity, the simulation is not aware of pages anyway.
Hey @HongshiTan will you have time to make the above proposed changes? Or should I? It's not urgent, we can also look into it after the break.
Hi Ben, yes, I’ll update it this week, and we can discuss it after the break.
By the way, I’m also working on enabling P2P access to NVIDIA GPUs, which requires some driver modifications. Once the functionality is tested and ready from my side, I would like to discuss with you how we can integrate it into the current driver.