Add `MADV_COLLAPSE` when committing a range?
Linux 6.1 introduces a new flag MADV_COLLAPSE. I wonder if it is also helpful in snmalloc.
Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request a synchronous collapse of memory at their own expense.
The benefits of this approach are:
- CPU is charged to the process that wants to spend the cycles for the THP
- Avoid unpredictable timing of khugepaged collapse
An immediate user of this new functionality are malloc() implementations that manage memory in hugepage-sized chunks, but sometimes subrelease memory back to the system in native-sized chunks via MADV_DONTNEED; zapping the pmd. Later, when the memory is hot, the implementation could madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain hugepage coverage and dTLB performance. TCMalloc is such an implementation that could benefit from this[2].
Only privately-mapped anon memory is supported for now, but it is expected that file and shmem support will be added later to support the use-case of backing executable text by THPs. Current support provided by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which might impair services from serving at their full rated load after (re)starting. Tricks like mremap(2)'ing text onto anonymous memory to immediately realize iTLB performance prevents page sharing and demand paging, both of which increase steady state memory footprint. With MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance and lower RAM footprints.
This call respects THP eligibility as determined by the system-wide /sys/kernel/mm/transparent_hugepage/enabled sysfs settings and the VMA flags for the memory range being collapsed.
Thanks for highlighting this. It would be interesting to integrate this into the backend, but I am not sure how to yet.