criu
criu copied to clipboard
Restore failed on Oracle Database Server 12c R2
Trying to verify the checkpoint/restore feature on Oracle Database Server 12c R2
There was similar issue (https://github.com/checkpoint-restore/criu/issues/255) last year when I tried on some non-production docker-1.10.0-dev version.
This time I tried on latest version as this checkpoint/restore is enabled in experimental version of regular release of docker.
Steps followed are:
Enabled experimental flag on "Docker version 17.06.2-ol, build d02b7ab"
bash-4.2$ docker run -d --env-file db_env.dat -p :1521 -p :5500 --name tc --security-opt seccomp:unconfined store/oracle/database-enterprise:12.2.0.1 b1ed6b3ff854241230e357432e779238e4b0a14a32ea9b0661f87697161ac51c
Created checkpoint once the db came up,
bash-4.2$ docker checkpoint create tc tc_ck1 tc_ck1
bash-4.2$ docker checkpoint ls tc CHECKPOINT NAME tc_ck1
Trying to start the container again using checkpoint,
bash-4.2$ docker start --checkpoint tc_ck1 tc Error response from daemon: oci runtime error: criu failed: type NOTIFY errno 0 log file: /var/lib/docker/containers/b1ed6b3ff854241230e357432e779238e4b0a14a32ea9b0661f87697161ac51c/checkpoints/tc_ck1/criu.work/restore-2017-11-17T02:06:14.324615919-08:00/restore.log
Ugh, there's some problem with AIO ring. Cc @tkhai and @avagin
1)@dineshputchala, which kernel version do you use? 2)Is this easy to reproduce in your envinronment?
From the shell version and the used glibc and other information in the restore.log this could be CentOS or RHEL. Which CRIU version are you using?
Strange that the CRIU version is not visible in the restore.log. We should also put the kernel version in the dump and restore log.
CRIU on CentOS/RHEL needs an extra patch if build from sources: https://git.centos.org/blob/rpms!criu.git/c7/SOURCES!aio-fix.patch
My tests with migrating the oracle database have always failed probably due to problems with monotonic time.
Or better: Migration works, but the database shuts down after migration.
CRIU on CentOS/RHEL needs an extra patch if build from sources: https://git.centos.org/blob/rpms!criu.git/c7/SOURCES!aio-fix.patch
There is because the formula in kernel has changed. I submitted patches to fix that year ago: https://marc.info/?l=openvz-criu&m=146366354304999&w=2
We do not support old kernels as Pasha said: https://marc.info/?l=openvz-criu&m=146373758226363&w=2
So, if the kernel is really old, we do not support it.
https://medium.com/@kolyshkin/oracle-in-a-docker-container-checkpoint-restore-debug-fun-dda98b7302ed
@tkhai @adrianreber @avagin
Docker host details:
bash-4.2$ docker -v Docker version 17.06.2-ol, build d02b7ab
bash-4.2$ uname -a Linux slc12moz 4.1.12-61.1.27.el7uek.x86_64 #2 SMP Fri Feb 3 12:31:56 PST 2017 x86_64 x86_64 x86_64 GNU/Linux
bash-4.2$ cat /etc/oracle-release Oracle Linux Server release 7.3
OS is Oracle Linux 7.3
CRIU version on docker host: criu-2.12-2.el7.x86_64
Its easy to reproduce , just we need to start the db container and checkpoint it. Try restore using checkpoint and it is reproduced every time.
@dineshputchala could you try the same with criu 3.6?
@avagin In the link shared by you, Bug #296 is mentioned and it is in open state . This Bug seems to be new feature to be added in CRIU for oracle db restore issue . Are you saying its same issue in this bug as well ?
In this bug you met another issue, but it is very probable that #296 will be the next one.
So this is interesting. If oracle linux uses the RHEL criu package on a newer kernel with the special AIO patch I added for RHEL it will not work. @dineshputchala you need to talk to your vendor and tell them that their criu package is wrong.
Installed CRIU 3.6 on my machine by building it as this package was not available in my repos. It took some time due to resolving lot of dependencies while building CRIU code and it was not easy and straight forward !
Hurray ... Finally I could install CRIU 3.6 version !
@avagin Attempted checkpoint/restore expt again on Oracle Database Server 12c R2 with latest CRIU version (3.6) !
This time, its different story ...
I was able to do checkpoint and restore did not throw any error but the db inside the container was not brought up successfully.
bash-4.2$ docker checkpoint create cont_criu3 cont_criu3_chk cont_criu3_chk
bash-4.2$ docker checkpoint ls cont_criu3 CHECKPOINT NAME cont_criu3_chk
bash-4.2$ docker start --checkpoint cont_criu3_chk cont_criu3 bash-4.2$
I checked the alert logs and I could see below errors and warnings,
Error attempting to elevate VKTM's priority: no further priority changes will be attempted for this process Error attempting to elevate VKTM's priority: no further priority changes will be attempted for this process Error attempting to elevate VKTM's priority: no further priority changes will be attempted for this process Error attempting to elevate VKTM's priority: no further priority changes will be attempted for this process
Warning: VKTM detected a forward time drift. Warning: 52 processes are still attach to shmid 98307: Warning: 51 processes are still attach to shmid 98307: ... ...
This seems to be same issue as observed in https://github.com/checkpoint-restore/criu/issues/296 Attached alert log as well , Please check !
@dineshputchala nice, now you need to talk to oracle that they should support migration. The oracle database seems to have problems if the time changes. This is expected as the time will keep on running as long as your container is stopped. It is even worse for migration as the kernel timers on the destination system will be completely different. So this is unrelated to CRIU and needs to be changed in the database.
A time namespace in the kernel could be a solution to handle this but this needs to be implemented in the kernel.
Similar issue is observed in bug https://github.com/checkpoint-restore/criu/issues/296 which requires changes in kernel and then in CRIU.
From alert log,
Warning: VKTM detected a forward time drift. Time drifts can result in unexpected behavior such as time-outs. Please see the VKTM trace file for more details: /u01/app/oracle/diag/rdbms/orclcdb/ORCLCDB/trace/ORCLCDB_vktm_64.trc 2017-11-27T06:43:17.786116+00:00 PMON (ospid: 58): terminating the instance due to error 472
Same issue is observed in Bug #296 which requires changes in kernel and then in CRIU.
Any update on this time-namespace feature implementation in kernel and CRIU ?
Andrey will say better about criu status, since he is diving into this at the moment. But I want to touch another direction. @dineshputchala, have you tried to request Oracle to workaround this issue for a while, before we have solution in kernel and criu?
@adrianreber @avagin Any update on implementation of time-namespace ?
@dineshputchala we are going to send RFC next week: https://github.com/0x7f454c46/linux/tree/wip/time-ns
Any update on this time-namespace feature implementation in kernel ?
Any update on CRIU changes for supporting this ?
@dineshputchala We sent the rfc version: https://lkml.org/lkml/2018/9/19/950
then we discussed it on LCP: https://www.youtube.com/watch?v=sjRUiqJVzOA&t=93s
And now we are working on the second version of these patches. We are going to post them this month.
@avagin ..after kernel changes , CRIU also needs to do changes right to use this feature right ?
@dineshputchala yes, we will need to add some code in CRIU to support time namespaces. But this should not be hard.
@avagin Which kernel version has the support for time namespaces ? CRIU support for time namespaces is done ?
https://lkml.org/lkml/2019/7/29/1699
@rst0git where do I check which kernel version has picked up these changes ? @avagin Is CRIU support for time namespace is also done ?
@dineshputchala the patch series for time namespace is not merged upstream yet. The link above is to the latest version of this patch series.