criu [GSoC: 2020] Optimize logging engine

This pr is a follow up of the pr #1145.

This makes changes to the existing logging system and adds an additional binary logger. The user now has an option to log in either binary or via the existing printer. The zdtm.py completely logs in binary and the decoder is used to convert these logs to human-readable format before printing them.

The changes that are still needed in this pr are:

Fixing the _Generic macro for centos: Using P99_GENERIC gives a gcc internal compiler error.
The default for _Generic has been made 0: It needs to be replaced with a way to determine if the input passed is a struct pointer or a function pointer.
The decoder fails for arm64 architecture: The va_list is passed as (void *), and this is accepted by all other architectures but is not acceptable in arm64.

Aug 23 '20 15:08 prakritigoyal19

Awesome :) it's a pity that aarch64 is the blocker. Is it possible to teach zdtm.py not to use binary logging on this arch? If yes, we could merge the pr into dev.

Aug 26 '20 13:08 xemul

@xemul Since this error is arising during the compilation of decoder in aarch64 architecture and arm architecture, I disabled binary logging for these 2 architectures. For all the others, it uses the flog printer, but for these 2, it uses the regular printer for logging. The decoder that is called by zdtm.py just copies the contents of one file to another and contents are printed from that file.

Aug 27 '20 10:08 prakritigoyal19

@prakritigoyal19 , the build on centos environment fais: https://travis-ci.org/github/checkpoint-restore/criu/jobs/721650710

Aug 29 '20 11:08 xemul

@prakritigoyal19 , the build on centos environment fais: https://travis-ci.org/github/checkpoint-restore/criu/jobs/721650710

Yes sir, the _Generic macro is supported from C11 and forth, and centos doesn't support this version yet, thus the error is generated. I had asked about this issue in the gitter community group. I was then suggested to try using the P99_GENERIC mentioned here: https://gustedt.wordpress.com/2012/01/02/emulating-c11-compiler-features-with-gcc-_generic/ As I have even mentioned in the pr message, when I try working with this in the code, it leads to internal compiler error and I am having issues trying to understand why this is occurring. I have been looking for other alternatives but couldn’t find a workable solution. Could you please suggest any other alternatives for me to try out?

Aug 29 '20 17:08 prakritigoyal19

Then we should just not compile in the support for flog in environments, that do not support _Generic

Aug 29 '20 17:08 xemul

flog/include/compiler.h → criu/include/compiler.h

Do you need this? I don't see anything new there from git diff flog/include/compiler.h include/common/compiler.h

Sep 01 '20 22:09 0x7f454c46

Looking at the end result, I have some questions:

You don't integrate it with parasite/restorer as I can see, right? Is it planned? How do they work right now: do they write text data into binary log file?
I don't see any synchronization in the buffer code. I doubt we always run in single-threaded mode even dropping the restorer: at least swrk mode and userns-helpers come in my mind that fork()s out and will have the same shared log buffer. I probably miss something, but won't they override each other?
As I understand the binary logging is for the performance. Could you also gather some numbers to see what's the speed-up?
The testing. Probably, there's some plan for testing it? What's on my mind: can we run C/R twice and compare the log output for text/binary forms? I think, numbers may differ (PIDs/FDs/etc), so we can cut the numbers from logs and compare that all the text is the same. Probably, when PIEs support gets added, messages may appear in a different sequence, but that's also solvable by sorting the results. This will verify that both logging engines keep working and provide the same result. Practically, it can be even possible to run such testing for all zdtm tests in a separate Travis job.

Thanks for working on it, Dmitry

Sep 02 '20 00:09 0x7f454c46

Looking at the end result, I have some questions:

You don't integrate it with parasite/restorer as I can see, right? Is it planned? How do they work right now: do they write text data into binary log file?

Currently, the logs from compel are just printed as text data into the binary file. The decoder is built in such a way that whenever it encounters such messages, it just copies these messages to the output file as it is. Before, when the decoder wasn’t able to parse these messages, there were plans of using register_logging for the libraries. Are we continuing with that plan or are we introducing the binary logging in compel too?

I don't see any synchronization in the buffer code. I doubt we always run in single-threaded mode even dropping the restorer: at least swrk mode and userns-helpers come in my mind that fork()s out and will have the same shared log buffer. I probably miss something, but won't they override each other?

When I ran this code in my system, during restoring I could see multiple forks, and when I compared it with the logs using the regular printer, the output produced in both were similar. I am looking at the documentation of how to run criu in swrk mode, but I am not completely clear on how to do that.

As I understand the binary logging is for the performance. Could you also gather some numbers to see what's the speed-up?

I have compared the time taken by only a simple loop. When I compare the time taken by it to dump and restore. Taken an average over a few trials, these are the values I receive. For dumping without binary: 53s Restore without binary: 20s Dump with binary: 42s Restore with binary: 18s This is a ~20% increase in the dump and ~10% in the restore. Are there any other programs you would like me to run and collect these numbers?

The testing. Probably, there's some plan for testing it? What's on my mind: can we run C/R twice and compare the log output for text/binary forms? I think, numbers may differ (PIDs/FDs/etc), so we can cut the numbers from logs and compare that all the text is the same. Probably, when PIEs support gets added, messages may appear in a different sequence, but that's also solvable by sorting the results. This will verify that both logging engines keep working and provide the same result. Practically, it can be even possible to run such testing for all zdtm tests in a separate Travis job.

I did the above-mentioned testing manually and compared the output produced by the normal logger and the binary logger. They were similar. Would you like me to try and write a script to automate this checking process?

Thanks for working on it, Dmitry

Sep 04 '20 14:09 prakritigoyal19

In the commit “flog: add option to fallback to default textual logging” I have added a check to see if the messages are being printed into the terminal. If the fd is pointed at the terminal then they’ll be printed using the regular printer. I added this commit as an extra check. Should I still make the change?

Sep 16 '20 05:09 prakritigoyal19

If I run simple env00 with this series applied:

/test/zdtm.py run -t zdtm/static/env00 -f h -k always

"decode" hangs on decoding for too much time at 100% cpu load:

crit/decoder test/dump/zdtm/static/env00/48/1/dump.log /proc/self/fd/0
Version: 3.14 (gitid v3.14-130-g6aa7d492e)
Running on snorch Linux 5.8.7-200.fc32.x86_64 #1 SMP Mon Sep 7 15:26:10 UTC 2020 x86_64
Loaded kdat cache from /run/criu.kdat
========================================
Dumping processes (pid: 48)
========================================
rlimit: RLIMIT_NOFILE unlimited for self
Running pre-dump scripts
irmap: Searching irmap cache in work dir
No irmap-cache image
irmap: Searching irmap cache in parent
irmap: No irmap cache

...

Spinning endlessly in decode_all while loop on same mbuf:

(gdb) p *mbuf@200
$25 = "(00.0001\000\000\000\000\000\000\000\000\030\000\000\000\000\000\000\000irmap: No irmap cache\n\000cache in parent\n\000\n\000iu.kdat\000 %s %s %s %s %s\n\000snorch\000Linux\000\065.8.7-200.fc32.x86_64\000#1 SMP Mon Sep 7 15:26:10 UTC 2020\000x86_64", '\000' <repeats 32 times>

Decode requires a good testing.

Sep 16 '20 10:09 Snorch

In the commit “flog: add option to fallback to default textual logging” I have added a check to see if the messages are being printed into the terminal. If the fd is pointed at the terminal then they’ll be printed using the regular printer. I added this commit as an extra check. Should I still make the change?

Yes you should. "Tty file" and "no file at all" are different things as for me.

Sep 16 '20 11:09 Snorch

In the commit “flog: add option to fallback to default textual logging” I have added a check to see if the messages are being printed into the terminal. If the fd is pointed at the terminal then they’ll be printed using the regular printer. I added this commit as an extra check. Should I still make the change?

Yes you should. "Tty file" and "no file at all" are different things as for me.

Okay sir, I will make this change

Sep 18 '20 10:09 prakritigoyal19

If I run simple env00 with this series applied:

/test/zdtm.py run -t zdtm/static/env00 -f h -k always

"decode" hangs on decoding for too much time at 100% cpu load:

crit/decoder test/dump/zdtm/static/env00/48/1/dump.log /proc/self/fd/0
Version: 3.14 (gitid v3.14-130-g6aa7d492e)
Running on snorch Linux 5.8.7-200.fc32.x86_64 #1 SMP Mon Sep 7 15:26:10 UTC 2020 x86_64
Loaded kdat cache from /run/criu.kdat
========================================
Dumping processes (pid: 48)
========================================
rlimit: RLIMIT_NOFILE unlimited for self
Running pre-dump scripts
irmap: Searching irmap cache in work dir
No irmap-cache image
irmap: Searching irmap cache in parent
irmap: No irmap cache

...

Spinning endlessly in decode_all while loop on same mbuf:

(gdb) p *mbuf@200
$25 = "(00.0001\000\000\000\000\000\000\000\000\030\000\000\000\000\000\000\000irmap: No irmap cache\n\000cache in parent\n\000\n\000iu.kdat\000 %s %s %s %s %s\n\000snorch\000Linux\000\065.8.7-200.fc32.x86_64\000#1 SMP Mon Sep 7 15:26:10 UTC 2020\000x86_64", '\000' <repeats 32 times>

Decode requires a good testing.

Do you have any suggestions on how I should go about testing it?

Sep 18 '20 10:09 prakritigoyal19

Do you have any suggestions on how I should go about testing it?

Probably you can run your script manually for all the logs zdtm can generate with ./test/zdtm.py run -a -k always to check that at least we would be able to read zdtm logs after your change.

And also I think you need to prepare a fake test log which covers binary log messages mixed with printf log messages and binary log messages for different types of variables, and all other cases you cover (you should know them better than me =)) And add a bash script to check it, so that it can be run in jenkins/travis/github-checks.

Sep 18 '20 12:09 Snorch

A friendly reminder that this PR had no activity for 30 days.

Jan 06 '21 00:01 github-actions[bot]

criu criu copied to clipboard

[GSoC: 2020] Optimize logging engine

criu
criu copied to clipboard