criu icon indicating copy to clipboard operation
criu copied to clipboard

criu: Add `on-demand` option for task restore

Open Yangfisher1 opened this issue 3 years ago • 44 comments

These patches include:

  • An extension to support restoring processes with sys_mmap instead of sys_preadv to optimize cases when the restored process does not need all the checkpointed pages. Note that the restoration of anonymous memory areas is left unchanged. The resulting performance improvement can be 30.65%(1.808s vs. 1.254s) for the test zdtm/static/mntns_ghost01. Also, it accelerates other tests, e.g., zdtm/static/file_locks00(1.807s vs. 1.432s) and zdtm/static/unlink_regular(1.936s vs. 1.321s) on my machine.
  • This extension can be optionally enabled (i.e., by default disabled) with criu restore -D dir --on-demand.
  • Proper helper messages are added so that usage of the extension can be viewed via criu -h.
  • The extension has passed all the tests, i.e., ./zdtm.py run -a.

Signed-off-by: Yuhan Yang [email protected]

Yangfisher1 avatar Dec 24 '21 06:12 Yangfisher1

Post an issue#1712

Yangfisher1 avatar Dec 24 '21 06:12 Yangfisher1

Not sure why you opened #1712. We can have all discussions here.

For the missing syscalls: You just have to add them to the syscall tables.

Overall: Have a look at all the CI failures and try to fix them. It still fails at many different places.

adrianreber avatar Dec 24 '21 09:12 adrianreber

Most of them failed because when compiling, the file criu/pie/restore-blob.h would have a line #define restorer_sym__export_restore_task.cold 0x475b. I don't know why this happened cause the file was auto-generated. What I've done is changing some codes in criu/pie/restorer.c, function __export_restore_task . When I test the codes, using gcc 5.4.0 wouldn't produce this problem.

Yangfisher1 avatar Dec 24 '21 10:12 Yangfisher1

It is a strange error. Never seen the that before. gcc 5.4.0 sounds really old (although our CentOS 7 test uses something even older). Maybe try it with a newer gcc and try to reproduce the error locally. I am not familiar with that part of the code, so that I cannot really help you. Hopefully someone else can help you figure out why the same symbol is generated with and without .cold.

The only idea that comes to my mind, is maybe it is related to something with the stack, because you add a couple of new variables. But that is just a wild uneducated guess.

adrianreber avatar Dec 24 '21 11:12 adrianreber

The only idea that comes to my mind, is maybe it is related to something with the stack, because you add a couple of new variables. But that is just a wild uneducated guess.

Yes. Compiling with the original version won't produce the problem even with gcc9, so it might be something with the stack I guess.

Yangfisher1 avatar Dec 24 '21 11:12 Yangfisher1

By the way, how can I add a sys_pread syscall in arm? I know that the syscall table is in compel/arch/arm/plugins/std/syscalls/syscall.def, and it has a syscall whose name is sys_pread64, but I don't know what's the meaning of code64 and code32. Is it enough to just add a new line pread with same code32 and code64?

Yangfisher1 avatar Dec 24 '21 14:12 Yangfisher1

Adding a flag -fno-reorder-blocks-and-partition in criu/Makefile seems to be able to solve the cold symbol problem. But I'm concerned that it might have some performance issues.

Yangfisher1 avatar Dec 27 '21 10:12 Yangfisher1

ci/circleci: test-local-gcc failed at test zdtm/static/pthread_timers in ns. However, I passed the test on my own machine, so I checked out the log and found out that

=== Run 136/423 =====----------- zdtm/static/pthread_timers
userns is supported
timens isn't supported on 5.4.0-1021-gcp
===================== Run zdtm/static/pthread_timers in ns =====================
Start test
./pthread_timers --pidfile=pthread_timers.pid --outfile=pthread_timers.out
Run criu dump
=[log]=> dump/zdtm/static/pthread_timers/232/1/dump.log
------------------------ grep Error ------------------------
b'(00.001254) seccomp: Collected tid_real 236 mode 0'
b"(00.001285) \tSeizing 236's 237 thread"
b'(00.001381) seccomp: Collected tid_real 237 mode 0'
b"(00.001389) \tSeizing 236's 238 thread"
b'(00.001398) Warn  (compel/src/lib/infect.c:126): Unable to interrupt task: 238 (Operation not permitted)'
b"(00.001424) \tSeizing 236's 238 thread"
b'(00.001432) Warn  (compel/src/lib/infect.c:126): Unable to interrupt task: 238 (Operation not permitted)'
b"(00.001448) \tSeizing 236's 238 thread"
b'(00.001454) Warn  (compel/src/lib/infect.c:126): Unable to interrupt task: 238 (Operation not permitted)'
b"(00.001470) \tSeizing 236's 238 thread"
b'(00.001476) Warn  (compel/src/lib/infect.c:126): Unable to interrupt task: 238 (Operation not permitted)'
b"(00.001490) \tSeizing 236's 238 thread"
b'(00.001496) Warn  (compel/src/lib/infect.c:126): Unable to interrupt task: 238 (Operation not permitted)'
b"(00.001514) \tSeizing 236's 238 thread"
b'(00.001521) Warn  (compel/src/lib/infect.c:126): Unable to interrupt task: 238 (Operation not permitted)'
b'(00.001542) Unlock network'
b'(00.001548) Unfreezing tasks into 1'
b'(00.001552) \tUnseizing 232 into 1'
b'(00.001575) \tUnseizing 236 into 1'
b'(00.001603) Error (criu/cr-dump.c:1788): Dumping FAILED.'
------------------------ ERROR OVER ------------------------
############## Test zdtm/static/pthread_timers FAIL at CRIU dump ###############
Send the 9 signal to  232
Wait for zdtm/static/pthread_timers(232) to die for 0.100000

Considering that I use a Ubuntu 20.04 with a kernel 5.11.0-43-generic, is that timens isn't supported on 5.4.0-1021-gcp caused the test failed?

By the way, I also wonder why ci/circleci: test-local-clang failed. The error info is

make -r -R -f /home/circleci/criu/scripts/nmk/scripts/main.mk makefile=Makefile obj=criu/arch/x86 all
criu/Makefile.packages:36: Can not find some of the required libraries
criu/Makefile.packages:37: Make sure the following packages are installed
criu/Makefile.packages:38: RPM based distros: protobuf protobuf-c protobuf-c-devel protobuf-compiler protobuf-devel protobuf-python libnl3-devel libcap-devel python3-future
criu/Makefile.packages:39: DEB based distros: libprotobuf-dev libprotobuf-c-dev protobuf-c-compiler protobuf-compiler python3-protobuf python3-future libnl-3-dev libcap-dev
criu/Makefile.packages:40: To run tests the following packages are needed
criu/Makefile.packages:41: RPM based distros: libaio-devel python3-PyYAML
criu/Makefile.packages:42: DEB based distros: python3-yaml libaio-dev libaio-dev
criu/Makefile.packages:43: *** Compilation aborted.  Stop.

Yangfisher1 avatar Dec 28 '21 01:12 Yangfisher1

Please do not include merge commits in your pull request. Use rebase to apply your patches on the latest criu-dev branch in your pull request. Also, please no fix up commits like the format fix. Just do the fix in the actual commit changing the code.

The zdtm/static/pthread_timers test failure is a flake and happens sometimes. We have not figured out why it happens. Rerunning CI will fix this CI failure.

The other error criu/Makefile.packages:36: Can not find some of the required libraries is not about packages, but is related to one of your changes. Clang is not happy with your changes: error: unknown argument: '-fno-reorder-blocks-and-partition'

Overall please make sure the Author of each commit matches the Signed-off-by exactly. You are using sometimes multiple Signed-off-by, sometimes they are different. Sometimes the author uses a gmail.com address, sometimes not. Try to be consitent.

adrianreber avatar Dec 28 '21 10:12 adrianreber

Please do not include merge commits in your pull request. Use rebase to apply your patches on the latest criu-dev branch in your pull request. Also, please no fix up commits like the format fix. Just do the fix in the actual commit changing the code.

The zdtm/static/pthread_timers test failure is a flake and happens sometimes. We have not figured out why it happens. Rerunning CI will fix this CI failure.

The other error criu/Makefile.packages:36: Can not find some of the required libraries is not about packages, but is related to one of your changes. Clang is not happy with your changes: error: unknown argument: '-fno-reorder-blocks-and-partition'

Overall please make sure the Author of each commit matches the Signed-off-by exactly. You are using sometimes multiple Signed-off-by, sometimes they are different. Sometimes the author uses a gmail.com address, sometimes not. Try to be consitent.

I'll try to fix these after I fix the .cold symbol problem, using flag -fno-reorder-blocks-and-partition works well for gcc, however clang doesn't have this flag. That's a pity, maybe I need more time to think about it. Thank you for your reply!

Yangfisher1 avatar Dec 28 '21 10:12 Yangfisher1

Codecov Report

Merging #1711 (ac5c87f) into criu-dev (8bddd88) will increase coverage by 0.00%. The diff coverage is 60.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##           criu-dev    #1711   +/-   ##
=========================================
  Coverage     70.74%   70.74%           
=========================================
  Files           125      125           
  Lines         30952    30959    +7     
=========================================
+ Hits          21897    21902    +5     
- Misses         9055     9057    +2     
Impacted Files Coverage Δ
criu/config.c 82.52% <50.00%> (-0.13%) :arrow_down:
criu/cr-restore.c 67.28% <66.66%> (+0.19%) :arrow_up:
criu/uffd.c 79.36% <0.00%> (-0.48%) :arrow_down:
criu/page-xfer.c 67.35% <0.00%> (+0.10%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 8bddd88...ac5c87f. Read the comment docs.

codecov-commenter avatar Dec 29 '21 13:12 codecov-commenter

Hmm, it's quite strange. Using clang to compile seemed fine at Compat Tests/ build(CLANG), however at X86_64 CLANG Test / build it showed that using the flag that clang doesn’t have again. But I did add a ifeq in Makefile

ifeq ($(HOSTCC), gcc)
	CFLAGS += -fno-reorder-blocks-and-partition
endif

So, is there anything special for X86_64 CLANG Test / build?

Yangfisher1 avatar Dec 29 '21 13:12 Yangfisher1

Hmm, it's quite strange. Using clang to compile seemed fine at Compat Tests/ build(CLANG), however at X86_64 CLANG Test / build it showed that using the flag that clang doesn’t have again. But I did add a ifeq in Makefile

ifeq ($(HOSTCC), gcc)
	CFLAGS += -fno-reorder-blocks-and-partition
endif

So, is there anything special for X86_64 CLANG Test / build?

I guess the reason might be when building the container, in Dockerfile.x86_64, The ARG CC is gcc. So the flag is just added into the CFLAGS. However, at some point, the CC is replaced by CLANG, but I'm not very sure about that.

Yangfisher1 avatar Dec 30 '21 02:12 Yangfisher1

Hmm, it's quite strange. Using clang to compile seemed fine at Compat Tests/ build(CLANG), however at X86_64 CLANG Test / build it showed that using the flag that clang doesn’t have again. But I did add a ifeq in Makefile

ifeq ($(HOSTCC), gcc)
	CFLAGS += -fno-reorder-blocks-and-partition
endif

So, is there anything special for X86_64 CLANG Test / build?

I guess the reason might be when building the container, in Dockerfile.x86_64, The ARG CC is gcc. So the flag is just added into the CFLAGS. However, at some point, the CC is replaced by CLANG, but I'm not very sure about that.

The reason is that I used HOSTCC to determine instead of CC.

Yangfisher1 avatar Dec 30 '21 06:12 Yangfisher1

@Yangfisher1 you can use git rebase to edit your patches and combine multiple patches into a single one. When you have multiple patches, the automated tests must pass after each patch, not just after the last one. This allows us to use tools such as git bisect.

@rst0git Does this mean that I have to merge all the commits into one, or does it mean that the commits corresponding to each patch are merged into one?In this case, I add a new option on-demand of criu. So should I just combine all commits into one?

Yangfisher1 avatar Jan 06 '22 01:01 Yangfisher1

Does this mean that I have to merge all the commits into one, or does it mean that the commits corresponding to each patch are merged into one? So should I just combine all commits into one?

Not necessarily because large commits are difficult to review. You should have each logical change into a separate commit. However, when dividing your changes into a series of commits, you need to ensure that CRIU builds and runs properly after each commit. This helps when other developers are using git bisect to track down a problem.

You can use as an example some of the previous pull request / commits that have introduced new options to CRIU:

  • --network-lock: https://github.com/checkpoint-restore/criu/pull/1539
  • --stream: https://github.com/checkpoint-restore/criu/commit/7d79a58f4daa9ddda015fc0aaca2a60d156d12cc
  • --pre-dump-mode: https://github.com/checkpoint-restore/criu/commit/20d4920a8bf74d1eceebc076bcc00889ba40e9f7

There are guidelines on how to divide your changes into a series of commits available in: https://www.kernel.org/doc/html/latest/process/submitting-patches.html#separate-your-changes

rst0git avatar Jan 06 '22 11:01 rst0git

Thanks! I'll refine my commits with these examples later.

Yangfisher1 avatar Jan 06 '22 12:01 Yangfisher1

@rst0git Done.

Yangfisher1 avatar Jan 07 '22 08:01 Yangfisher1

@avagin I've already fixed the type conversion problem and others. It still need some type conversion to compare the address.

Yangfisher1 avatar Jan 09 '22 12:01 Yangfisher1

Hmm, it's quite strange because I tried to use make indent and it showed me format like this. However, it still failed the code linter. I don't know why.

Yangfisher1 avatar Jan 19 '22 09:01 Yangfisher1

Hmm, it's quite strange because I tried to use make indent and it showed me format like this. However, it still failed the code linter. I don't know why.

Make sure to use the same version of clang-format (13) as our CI.

adrianreber avatar Jan 19 '22 09:01 adrianreber

Hmm, it's quite strange because I tried to use make indent and it showed me format like this. However, it still failed the code linter. I don't know why.

Make sure to use the same version of clang-format (13) as our CI.

My bad. I'm using a clang whose version is 11.0.0.

Yangfisher1 avatar Jan 19 '22 11:01 Yangfisher1

How do you test your changes? Could you add the on-demand option in zdtm.py and execute tests in CI?

avagin avatar Jan 19 '22 16:01 avagin

How do you test your changes? Could you add the on-demand option in zdtm.py and execute tests in CI?

I just simply replace line 1616, criu/pie/restorer.c

			if (args->on_demand_restore) {

into

			if (!args->on_demand_restore) {

and then test these changes. I don't know how to add this on-demand option in zdtm.py. Could you please tell me how to do?

Yangfisher1 avatar Jan 20 '22 02:01 Yangfisher1

I don't know how to add this on-demand option in zdtm.py. Could you please tell me how to do?

@Yangfisher1 perhaps looking at the git history could help. This commit adds the --pre-dump-mode option to zdtm.py https://github.com/checkpoint-restore/criu/commit/20d4920a8bf74d1eceebc076bcc00889ba40e9f7#diff-c9838f32b66301675f1aaba9ee24003ac88febf1f77cd993feba0db4b75d7ea6

rst0git avatar Jan 20 '22 09:01 rst0git

@rst0git Thank you! I'll try it later.

Yangfisher1 avatar Jan 21 '22 03:01 Yangfisher1

@avagin Now I add this option into zdtm test. Could you please tell me whether it's enough for CI because I followed the commit 20d4920 ? Do I need to add some extra test for this new feature? I'm not sure how to add new tests.

Yangfisher1 avatar Jan 22 '22 04:01 Yangfisher1

@Yangfisher1 you might need to update scripts/ci/run-ci-tests.sh to run the CI tests with --on-demand.

rst0git avatar Jan 22 '22 11:01 rst0git

@rst0git Get it. I'll update this file.

Yangfisher1 avatar Jan 23 '22 04:01 Yangfisher1

It seems like the test aarch64 Fedora Rawhide would fail when compiling, quite strange.

Yangfisher1 avatar Jan 23 '22 06:01 Yangfisher1