oci-systemd-hook icon indicating copy to clipboard operation
oci-systemd-hook copied to clipboard

running prestart hook 0 caused error

Open pschiffe opened this issue 8 years ago • 19 comments

Hello,

oci-systemd-hook on latest RHEL atomic host doesn't work with runc.

# atomic host status
State: idle
Deployments:
● rhel-atomic-host:rhel-atomic-host/7/x86_64/standard
             Version: 7.3.6 (2017-06-23 16:20:45)
              Commit: e073a47baa605a99632904e4e05692064302afd8769a15290d8ebe8dbfd3c81b
# rpm -q oci-systemd-hook runc
oci-systemd-hook-0.1.7-4.gite533efa.el7.x86_64
runc-1.0.0-6.gite800860.el7.x86_64

This is the error I'm getting:

runc[15209]: container_linux.go:259: starting container process caused "process_linux.go:345: container init caused \"process_linux.go:328: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: \\\"\""

Config.json template is here: https://github.com/pschiffe/gce-system-container/blob/master/image/config.json.template

Can you help? How to debug?

pschiffe avatar Jul 19 '17 19:07 pschiffe

It might be a mismatch between the version of runc and the version of oci-systemd-hook? Did those two versions ship together? How did you call out to oci-systemd-hook?

rhatdan avatar Jul 19 '17 19:07 rhatdan

Those two packages should align correctly as both are from the latest version of rhel atomic host (7.3.6):

# rpm -q oci-systemd-hook runc
oci-systemd-hook-0.1.7-4.gite533efa.el7.x86_64
runc-1.0.0-6.gite800860.el7.x86_64

This is how I'm calling hooks in the config.json.template:

    "hooks": {
        "prestart": [
            {
                "path": "/usr/libexec/oci/hooks.d/oci-systemd-hook"
            },
            {
                "path": "/usr/libexec/oci/hooks.d/oci-register-machine"
            }
        ],
        "poststop": [
            {
                "path": "/usr/libexec/oci/hooks.d/oci-systemd-hook"
            },
            {
                "path": "/usr/libexec/oci/hooks.d/oci-register-machine"
            }
        ]
},

pschiffe avatar Jul 19 '17 21:07 pschiffe

The hooks expect the first option to the hook to be the prestart/poststart. Add those arguments and it should work.

Hooks can not tell which phase they are running unless you pass in the argv[1]

rhatdan avatar Jul 19 '17 23:07 rhatdan

Unfortunately, even with the args I see the same error:

runc[16416]: container_linux.go:259: starting container process caused "process_linux.go:345: container init caused \"process_linux.go:328: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: \\\"\""

I've updated the config.json hooks to look like this:

    "hooks": {
        "prestart": [
            {
                "path": "/usr/libexec/oci/hooks.d/oci-systemd-hook",
                "args": [ "prestart" ]
            },
            {
                "path": "/usr/libexec/oci/hooks.d/oci-register-machine",
                "args": [ "prestart" ]
            }
        ],
        "poststop": [
            {
                "path": "/usr/libexec/oci/hooks.d/oci-systemd-hook",
                "args": [ "poststop" ]
            },
            {
                "path": "/usr/libexec/oci/hooks.d/oci-register-machine",
                "args": [ "poststop" ]
            }
        ]
    },

BTW, my previous configuration (without the args) worked fine on rhel atomic host 7.3.2.

pschiffe avatar Jul 20 '17 12:07 pschiffe

I've tried the same image on centos atomic host continuous:

# atomic host status
State: idle
Deployments:
● centos-atomic-continuous:centos-atomic-host/7/x86_64/devel/continuous
                   Version: 7.2017.490 (2017-07-19 16:05:22)
                    Commit: a948637e77018755831659c791a8cf8595f6d267d4fddba24a6b42f0ec6f1bd7

# rpm -q oci-systemd-hook runc
oci-systemd-hook-0.1.7-4.gite533efa.el7.x86_64
runc-1.0.0-9.git6394544.el7.x86_64

But I see the same error:

runc[12188]: container_linux.go:265: starting container process caused "process_linux.go:339: container init caused \"process_linux.go:322: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: \\\"\""

pschiffe avatar Jul 20 '17 12:07 pschiffe

@mrunalp WDYT?

@pschiffe Are you seeing anything in the journal? Are you seeing the same thing in Fedora?

rhatdan avatar Jul 20 '17 12:07 rhatdan

Hmm, I see something more in journal:

Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: Starting Linux Guest Environment for Google Compute Engine...
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: Scope libcontainer-12188-systemd-test-default-dependencies.scope has no PIDs. Refusing.
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: Scope libcontainer-12188-systemd-test-default-dependencies.scope has no PIDs. Refusing.
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: Created slice libcontainer_12188_systemd_test_default.slice.
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: Starting libcontainer_12188_systemd_test_default.slice.
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: Removed slice libcontainer_12188_systemd_test_default.slice.
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: Stopping libcontainer_12188_systemd_test_default.slice.
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: Started libcontainer container gce-agents.
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: Starting libcontainer container gce-agents.
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal prestart[12201]: systemdhook <error>: root not found in state: Success
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: Stopped libcontainer container gce-agents.
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: Stopping libcontainer container gce-agents.
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal poststop[12204]: systemdhook <error>: root not found in state: Success
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal runc[12188]: container_linux.go:265: starting container process caused "process_linux.go:339: container init caused \"process_linux.go:322: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: \\\"\""
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: gce-agents.service: main process exited, code=exited, status=1/FAILURE
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: Unit gce-agents.service entered failed state.
Jul 20 12:18:03 centosah-1.c.ose-refarch.internal systemd[1]: gce-agents.service failed.

pschiffe avatar Jul 20 '17 12:07 pschiffe

@pschiffe you should specify args as:

        "prestart": [
            {
                "path": "/usr/libexec/oci/hooks.d/oci-systemd-hook",
                "args": [ "oci-systemd-hook", "prestart" ]
            },

as they are passed as they are to exec. args[0] can really be anything as it is not used.

It would be nice if it still works without any arg as it used to work before, I am taking a look right now.

giuseppe avatar Jul 20 '17 12:07 giuseppe

Didn't help, still the same error:

Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Starting Linux Guest Environment for Google Compute Engine...
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Scope libcontainer-12731-systemd-test-default-dependencies.scope has no PIDs. Refusing.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Scope libcontainer-12731-systemd-test-default-dependencies.scope has no PIDs. Refusing.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Created slice libcontainer_12731_systemd_test_default.slice.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Starting libcontainer_12731_systemd_test_default.slice.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Removed slice libcontainer_12731_systemd_test_default.slice.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Stopping libcontainer_12731_systemd_test_default.slice.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Started libcontainer container gce-agents.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Starting libcontainer container gce-agents.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal oci-systemd-hook[12744]: systemdhook <error>: root not found in state: Success
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Stopped libcontainer container gce-agents.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Stopping libcontainer container gce-agents.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal oci-systemd-hook[12747]: systemdhook <error>: root not found in state: Success
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal runc[12731]: container_linux.go:265: starting container process caused "process_linux.go:339: container init caused \"process_linux.go:322: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: \\\"\""
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: gce-agents.service: main process exited, code=exited, status=1/FAILURE
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Unit gce-agents.service entered failed state.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: gce-agents.service failed.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: gce-agents.service holdoff time over, scheduling restart.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: start request repeated too quickly for gce-agents.service
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Failed to start Linux Guest Environment for Google Compute Engine.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: Unit gce-agents.service entered failed state.
Jul 20 12:50:12 centosah-1.c.ose-refarch.internal systemd[1]: gce-agents.service failed.
    "hooks": {
        "prestart": [
            {
                "path": "/usr/libexec/oci/hooks.d/oci-systemd-hook",
                "args": [ "oci-systemd-hook", "prestart" ]
            },
            {
                "path": "/usr/libexec/oci/hooks.d/oci-register-machine",
                "args": [ "oci-register-machine", "prestart" ]
            }
        ],
        "poststop": [
            {
                "path": "/usr/libexec/oci/hooks.d/oci-systemd-hook",
                "args": [ "oci-systemd-hook", "poststop" ]
            },
            {
                "path": "/usr/libexec/oci/hooks.d/oci-register-machine",
                "args": [ "oci-register-machine", "poststop" ]
            }
        ]
    },

pschiffe avatar Jul 20 '17 12:07 pschiffe

This means that the config file you generated does not indicate with the "root" is, oci-systemd-hook and oci-umount need to go to the root of the container to mount or umount content, can you add this to your config?

rhatdan avatar Jul 20 '17 12:07 rhatdan

@giuseppe we can not rely on the fact that he pid=0 or not, since runc specifies three ways to run a hook. prestart poststart poststop We can't tell the difference betwee a prestart and a poststart call. Since we are building this feature into cri-o, we don't want to ignore the specification.

rhatdan avatar Jul 20 '17 12:07 rhatdan

@rhatdan do you have an example how to specify the "root"? I can't find what you mean. I have in the config the following:

    "root": {
        "path": "rootfs",
        "readonly": true
    },

and in the process section, there is "cwd": "/",

pschiffe avatar Jul 20 '17 13:07 pschiffe

We can decide if be backward compatible and handle only prestart and poststop when the type is not specified. On the other hand there are not probably many users out there so we can just enforce it and not worry in the future of supporting this additional case.

Anyway the current development version support the case where the hook is not specified. The issue reported here depend on the hook version that is too old. The issue is fixed upstream with:

commit 69858facb2597d4bc0df3ed405876d6c167b943f Author: Daniel J Walsh [email protected] Date: Fri Jun 23 19:31:05 2017 +0000

Needs this change to work directly with runc and cri-o

The path bundle path passed to a container is called bundle, not bundlepath.
Also root is not in the state file, but can be retrieved from the config.json.

Signed-off-by: Daniel J Walsh <[email protected]>

Also this other commit is required, otherwise oci-systemd-hook will just segfault:

commit 40ab578cc3550b25c52e64e43d82a65d23e3365e Author: Jason Wessel [email protected] Date: Wed Jul 12 09:21:44 2017 -0700

Allow container definitions where rootfs is not an absolute path

In earlier versions of the runc frame work the rootfs path was passed
as a key with the initial json that was passed on the stdin and it was
automatically computed to be an absolute path.

This translation to an absolute path must be done in the hook based on
the bundlePath.  This allows the config.json to be relocated by the
container hosting system storage without modifying the config.json.

Signed-off-by: Jason Wessel <[email protected]>

giuseppe avatar Jul 20 '17 13:07 giuseppe

So we need to get this updated in RHEL7 package. @lsm5 Can you create a new oci-systemd-hook package for RHEL?

rhatdan avatar Jul 20 '17 13:07 rhatdan

rebased to 1e84754 in rhel 7.4 branch

fkluknav avatar Jul 20 '17 16:07 fkluknav

Would it be possible to also get it to the rhel atomic host 7.3?

pschiffe avatar Jul 21 '17 12:07 pschiffe

There is only on stream of extras, so if you install oci-umount after it gets shipped, it wil be placed into RHEL7.3

rhatdan avatar Jul 21 '17 12:07 rhatdan

Some of the discussion here is about how the hook decides which stage its running in. The most portable approach to that is to use status in stdin's state JSON (kubernetes-incubator/cri-o#1360). I have an open PR for that against the very-similar oci-umount: projectatomic/oci-umount#35. If/when that lands I'll cherry-pick it over here.

wking avatar Feb 22 '18 23:02 wking

something maybe useful: https://github.com/intel/cc-oci-runtime/issues/270

runc is passing root rather than bundlePath to the hooks. oci-systemd-hook is compounding the problem by requiring the erroneous root value.

caoruidong avatar Mar 02 '18 02:03 caoruidong