criu
criu copied to clipboard
mem: use MAP_NORESERVE for mmap() that have content restored
Using MAP_NORESERVE for mmap() bypasses the kernel memory overcommit logic. This is useful when streaming an image to CRIU that is pre-loaded in memory.
We should always use MAP_NORESERVE for memory regions that are to be populated by reading the image.
Ah, yes, true. Any suggestions to deal with the issue?
We can try adding an madivse() that will toggle this behaviour to the kernel, but chances are that it won't get accepted...
What is the issue this patch was going to solve?
Why do you think it won't get accepted in the kernel?
The issue is the following: When restoring an image from remote storage (e.g., S3, GCS), there are two ways:
- Download the image on local storage, and then invoke CRIU to restore from local storage. This is quite slow unfortunately.
- Download the entire image in memory, and then serve it to CRIU. It's easier to buffer the image in memory because CRIU asks for image files in a different order than the checkpointed file order. Once the full image is in memory, it can potentially occupy the entirety of the host memory. When CRIU restores the application. It restores its vmas, which can be big, so it needs to over-commit almost at 200% of host memory. Total memory utilization is not impacted during restore, as data pages are going from the in-memory image to the application, without keeping extra copies.
@nviennot You might want to take a look at --auto-dedup feature. (Though it is likely not handling all the cases.) The idea is to punch holes in images (implicitly freeing memory if image lays in memory (e.g. on tmpfs)) almost just after vmas are restored, this way memory usage should be around 100%.
I see. However, using tmpfs doesn't allow to over-commit memory: I tried to make a large mmap() when most of the host memory is used by tmpfs, and it fails. Passing MAP_NORESERVE to mmap() makes it succeed, even under tmpfs memory pressure.
On Thu, Apr 9, 2020 at 8:44 PM Nicolas Viennot [email protected] wrote:
Why do you think it won't get accepted in the kernel?
I don't know it won't be accepted, I know that mm people usually don't like adding new madvise() or similar calls. That does not mean we cannot and should not try :)
The issue is the following: When restoring an image from remote storage (e.g., S3, GCS), there are two ways:
Download the image on local storage, and then invoke CRIU to restore from local storage. This is quite slow unfortunately. Download the entire image in memory, and then serve it to CRIU. It's easier to buffer the image in memory because CRIU asks for image files in a different order than the checkpointed file order. Once the full image is in memory, it can potentially occupy the entirety of the host memory. When CRIU restores the application. It restores its vmas, which can be big, so it needs to over-commit almost at 200%. Total memory utilization is not impacted during restore, as data pages are going from the in-memory image to the application, without keeping extra copies.
If the restore runs as root you may temporarily set /proc/sys/vm/overcommit_memory to 1 (OVERCOMMIT_ALWAYS). It's a bit hacky, but should do the trick.
I can try.
Overall, I see no reason the following mmap flags couldn't be adjusted by mprotect/madvise:
MAP_POPULATE
MAP_NONBLOCK
MAP_UNINITIALIZED
MAP_NORESERVE
MAP_SYNC
Regarding /proc/sys/vm/overcommit_memory, I unfortunately don't have any sort of privileged access when doing checkpoint/restore.
A friendly reminder that this PR had no activity for 30 days.