criu icon indicating copy to clipboard operation
criu copied to clipboard

Restore failed on Oracle Database Server 12c R2

Open dineshputchala opened this issue 7 years ago • 38 comments

Trying to verify the checkpoint/restore feature on Oracle Database Server 12c R2

There was similar issue (https://github.com/checkpoint-restore/criu/issues/255) last year when I tried on some non-production docker-1.10.0-dev version.

This time I tried on latest version as this checkpoint/restore is enabled in experimental version of regular release of docker.

Steps followed are:

Enabled experimental flag on "Docker version 17.06.2-ol, build d02b7ab"

bash-4.2$ docker run -d --env-file db_env.dat -p :1521 -p :5500 --name tc --security-opt seccomp:unconfined store/oracle/database-enterprise:12.2.0.1 b1ed6b3ff854241230e357432e779238e4b0a14a32ea9b0661f87697161ac51c

Created checkpoint once the db came up,

bash-4.2$ docker checkpoint create tc tc_ck1 tc_ck1

bash-4.2$ docker checkpoint ls tc CHECKPOINT NAME tc_ck1

Trying to start the container again using checkpoint,

bash-4.2$ docker start --checkpoint tc_ck1 tc Error response from daemon: oci runtime error: criu failed: type NOTIFY errno 0 log file: /var/lib/docker/containers/b1ed6b3ff854241230e357432e779238e4b0a14a32ea9b0661f87697161ac51c/checkpoints/tc_ck1/criu.work/restore-2017-11-17T02:06:14.324615919-08:00/restore.log

dineshputchala avatar Nov 20 '17 04:11 dineshputchala

Attached the restore.log

restore.log

dineshputchala avatar Nov 20 '17 04:11 dineshputchala

Ugh, there's some problem with AIO ring. Cc @tkhai and @avagin

xemul avatar Nov 21 '17 10:11 xemul

1)@dineshputchala, which kernel version do you use? 2)Is this easy to reproduce in your envinronment?

tkhai avatar Nov 21 '17 11:11 tkhai

From the shell version and the used glibc and other information in the restore.log this could be CentOS or RHEL. Which CRIU version are you using?

Strange that the CRIU version is not visible in the restore.log. We should also put the kernel version in the dump and restore log.

CRIU on CentOS/RHEL needs an extra patch if build from sources: https://git.centos.org/blob/rpms!criu.git/c7/SOURCES!aio-fix.patch

adrianreber avatar Nov 21 '17 12:11 adrianreber

My tests with migrating the oracle database have always failed probably due to problems with monotonic time.

adrianreber avatar Nov 21 '17 12:11 adrianreber

Or better: Migration works, but the database shuts down after migration.

adrianreber avatar Nov 21 '17 12:11 adrianreber

CRIU on CentOS/RHEL needs an extra patch if build from sources: https://git.centos.org/blob/rpms!criu.git/c7/SOURCES!aio-fix.patch

There is because the formula in kernel has changed. I submitted patches to fix that year ago: https://marc.info/?l=openvz-criu&m=146366354304999&w=2

We do not support old kernels as Pasha said: https://marc.info/?l=openvz-criu&m=146373758226363&w=2

So, if the kernel is really old, we do not support it.

tkhai avatar Nov 21 '17 12:11 tkhai

https://medium.com/@kolyshkin/oracle-in-a-docker-container-checkpoint-restore-debug-fun-dda98b7302ed

avagin avatar Nov 21 '17 23:11 avagin

@tkhai @adrianreber @avagin

Docker host details:

bash-4.2$ docker -v Docker version 17.06.2-ol, build d02b7ab

bash-4.2$ uname -a Linux slc12moz 4.1.12-61.1.27.el7uek.x86_64 #2 SMP Fri Feb 3 12:31:56 PST 2017 x86_64 x86_64 x86_64 GNU/Linux

bash-4.2$ cat /etc/oracle-release Oracle Linux Server release 7.3

OS is Oracle Linux 7.3

CRIU version on docker host: criu-2.12-2.el7.x86_64

Its easy to reproduce , just we need to start the db container and checkpoint it. Try restore using checkpoint and it is reproduced every time.

dineshputchala avatar Nov 22 '17 04:11 dineshputchala

@dineshputchala could you try the same with criu 3.6?

avagin avatar Nov 22 '17 04:11 avagin

@avagin In the link shared by you, Bug #296 is mentioned and it is in open state . This Bug seems to be new feature to be added in CRIU for oracle db restore issue . Are you saying its same issue in this bug as well ?

dineshputchala avatar Nov 22 '17 05:11 dineshputchala

In this bug you met another issue, but it is very probable that #296 will be the next one.

avagin avatar Nov 22 '17 06:11 avagin

So this is interesting. If oracle linux uses the RHEL criu package on a newer kernel with the special AIO patch I added for RHEL it will not work. @dineshputchala you need to talk to your vendor and tell them that their criu package is wrong.

adrianreber avatar Nov 22 '17 10:11 adrianreber

Installed CRIU 3.6 on my machine by building it as this package was not available in my repos. It took some time due to resolving lot of dependencies while building CRIU code and it was not easy and straight forward !

Hurray ... Finally I could install CRIU 3.6 version !

dineshputchala avatar Nov 27 '17 10:11 dineshputchala

@avagin Attempted checkpoint/restore expt again on Oracle Database Server 12c R2 with latest CRIU version (3.6) !

This time, its different story ...

I was able to do checkpoint and restore did not throw any error but the db inside the container was not brought up successfully.

bash-4.2$ docker checkpoint create cont_criu3 cont_criu3_chk cont_criu3_chk

bash-4.2$ docker checkpoint ls cont_criu3 CHECKPOINT NAME cont_criu3_chk

bash-4.2$ docker start --checkpoint cont_criu3_chk cont_criu3 bash-4.2$

I checked the alert logs and I could see below errors and warnings,

Error attempting to elevate VKTM's priority: no further priority changes will be attempted for this process Error attempting to elevate VKTM's priority: no further priority changes will be attempted for this process Error attempting to elevate VKTM's priority: no further priority changes will be attempted for this process Error attempting to elevate VKTM's priority: no further priority changes will be attempted for this process

Warning: VKTM detected a forward time drift. Warning: 52 processes are still attach to shmid 98307: Warning: 51 processes are still attach to shmid 98307: ... ...

This seems to be same issue as observed in https://github.com/checkpoint-restore/criu/issues/296 Attached alert log as well , Please check !

alert.log

dineshputchala avatar Nov 27 '17 10:11 dineshputchala

@dineshputchala nice, now you need to talk to oracle that they should support migration. The oracle database seems to have problems if the time changes. This is expected as the time will keep on running as long as your container is stopped. It is even worse for migration as the kernel timers on the destination system will be completely different. So this is unrelated to CRIU and needs to be changed in the database.

A time namespace in the kernel could be a solution to handle this but this needs to be implemented in the kernel.

adrianreber avatar Nov 27 '17 11:11 adrianreber

Similar issue is observed in bug https://github.com/checkpoint-restore/criu/issues/296 which requires changes in kernel and then in CRIU.

dineshputchala avatar Nov 28 '17 07:11 dineshputchala

From alert log,

Warning: VKTM detected a forward time drift. Time drifts can result in unexpected behavior such as time-outs. Please see the VKTM trace file for more details: /u01/app/oracle/diag/rdbms/orclcdb/ORCLCDB/trace/ORCLCDB_vktm_64.trc 2017-11-27T06:43:17.786116+00:00 PMON (ospid: 58): terminating the instance due to error 472

Same issue is observed in Bug #296 which requires changes in kernel and then in CRIU.

dineshputchala avatar Dec 01 '17 10:12 dineshputchala

Any update on this time-namespace feature implementation in kernel and CRIU ?

dineshputchala avatar Sep 03 '18 06:09 dineshputchala

Andrey will say better about criu status, since he is diving into this at the moment. But I want to touch another direction. @dineshputchala, have you tried to request Oracle to workaround this issue for a while, before we have solution in kernel and criu?

tkhai avatar Sep 03 '18 08:09 tkhai

@adrianreber @avagin Any update on implementation of time-namespace ?

dineshputchala avatar Sep 03 '18 08:09 dineshputchala

@dineshputchala we are going to send RFC next week: https://github.com/0x7f454c46/linux/tree/wip/time-ns

avagin avatar Sep 05 '18 01:09 avagin

Any update on this time-namespace feature implementation in kernel ?

Any update on CRIU changes for supporting this ?

dineshputchala avatar Jan 11 '19 05:01 dineshputchala

@dineshputchala We sent the rfc version: https://lkml.org/lkml/2018/9/19/950

then we discussed it on LCP: https://www.youtube.com/watch?v=sjRUiqJVzOA&t=93s

And now we are working on the second version of these patches. We are going to post them this month.

avagin avatar Jan 11 '19 07:01 avagin

@avagin ..after kernel changes , CRIU also needs to do changes right to use this feature right ?

dineshputchala avatar Feb 18 '19 05:02 dineshputchala

@dineshputchala yes, we will need to add some code in CRIU to support time namespaces. But this should not be hard.

avagin avatar Feb 21 '19 19:02 avagin

@avagin Which kernel version has the support for time namespaces ? CRIU support for time namespaces is done ?

dineshputchala avatar Aug 01 '19 03:08 dineshputchala

https://lkml.org/lkml/2019/7/29/1699

rst0git avatar Aug 01 '19 07:08 rst0git

@rst0git where do I check which kernel version has picked up these changes ? @avagin Is CRIU support for time namespace is also done ?

dineshputchala avatar Aug 12 '19 06:08 dineshputchala

@dineshputchala the patch series for time namespace is not merged upstream yet. The link above is to the latest version of this patch series.

rst0git avatar Aug 12 '19 08:08 rst0git