resource-agents icon indicating copy to clipboard operation
resource-agents copied to clipboard

ERROR: LXC container name not set!

Open iglov opened this issue 2 years ago • 23 comments

OS: Debian 10/11/12 Kernel: 5.10.0-15-amd64 - 6.1.0-18-amd64 Env (depens on deb ver):

  • resource-agents 1:4.7.0-1~bpo10+1, pacemaker 2.0.5-2, corosync 3.1.2-2, lxc 1:4.0.6-2
  • resource-agents 1:4.12.0-2, pacemaker 2.1.5-1+deb12u1, corosync 3.1.7-1, lxc 1:5.0.2-1+deb12u2

Just trying to add new resource

lxc-start -n front-2.fr
pcs resource create front-2.fr ocf:heartbeat:lxc config=/mnt/cluster_volumes/lxc2/front-2.fr/config container=front-2.fr

After ~5min want to remove it pcs resource remove front-2.fr --force got an error and cluster starts to migrate Mar 29 23:28:51 cse2.fr lxc(front-2.fr)[2103391]: ERROR: LXC container name not set!

as i can see in /usr/lib/ocf/resource.d/heartbeat/lxc the error spawns when agent can't get OCF_RESKEY_container variable. This bug is only on clusters who work without reboot a long time. For example after fencing i can add/remove lxc resources and everything will be fine for a while.

The question is: why? And how to debug it?

iglov avatar Mar 30 '23 09:03 iglov

This might be due to the probe-action.

You can try changing https://github.com/ClusterLabs/resource-agents/blob/fe1a2f88ac32dfaba86baf995094e2b4fa0d8def/heartbeat/lxc.in#L343 to ocf_is_probe || LXC_validate.

oalbrigt avatar Mar 30 '23 10:03 oalbrigt

Seems like the agent already takes care of probe-actions, so I'll have to investigate further what might cause it.

oalbrigt avatar Mar 30 '23 10:03 oalbrigt

Hey @oalbrigt , thanks 4 reply!

to ocf_is_probe || LXC_validate.

Yep, ofc i can try, but what the point if as we can see, the OCF_RESKEY_container var isn't exists or the agent just doesn't know anything about it. So even if i'll try it, he wont stop the container here for the same reason https://github.com/ClusterLabs/resource-agents/blob/fe1a2f88ac32dfaba86baf995094e2b4fa0d8def/heartbeat/lxc.in#L184

iglov avatar Mar 30 '23 10:03 iglov

@kgaillot Do you know what might cause OCF_RESKEY_ variables not being set when doing pcs resource remove --force?

oalbrigt avatar Mar 31 '23 07:03 oalbrigt

@kgaillot Do you know what might cause OCF_RESKEY_ variables not being set when doing pcs resource remove --force?

No, that's odd. Was the command tried without --force first? It shouldn't normally be necessary, so if it was, that might point to an issue.

kgaillot avatar Apr 03 '23 18:04 kgaillot

Hey @kgaillot , thx 4 reply! Nope, without --force the result is the same.

iglov avatar Apr 03 '23 18:04 iglov

@iglov @oalbrigt , can one of you try dumping the environment to a file from within the stop command? Are no OCF variables set, or is just that one missing?

kgaillot avatar Apr 03 '23 18:04 kgaillot

Well, i can try if you tell me how to do that and if i find cluster in the same state.

iglov avatar Apr 03 '23 18:04 iglov

Something like env > /run/lxc.env in the agent's stop action

kgaillot avatar Apr 03 '23 19:04 kgaillot

Oh, you mean i should place env > /run/lxc.env somewhere in the /usr/lib/ocf/resource.d/heartbeat/lxc in LXC_stop() { ... } ? But it won't work cuz: 1. It died before LXC_stop() in the LXC_validate() ; 2. After fencing node will reboot and/run unmounts. So, i think it would be better to put env > /root/lxc.env in LXC_validate() If all correct i will try when find the cluster with this bug.

iglov avatar Apr 03 '23 19:04 iglov

That sounds right

kgaillot avatar Apr 03 '23 21:04 kgaillot

Hey guyz! I got it. Tried to stop container nsa-1.ny with pcs resource remove nsa-1.ny --force and got some debug:

OCF_ROOT=/usr/lib/ocf
OCF_RESKEY_crm_feature_set=3.1.0
HA_LOGFACILITY=daemon
PCMK_debug=0
HA_debug=0
PWD=/var/lib/pacemaker/cores
HA_logfacility=daemon
OCF_EXIT_REASON_PREFIX=ocf-exit-reason:
OCF_RESOURCE_PROVIDER=heartbeat
PCMK_service=pacemaker-execd
PCMK_mcp=true
OCF_RA_VERSION_MAJOR=1
VALGRIND_OPTS=--leak-check=full --trace-children=no --vgdb=no --num-callers=25 --log-file=/var/lib/pacemaker/valgrind-%p --suppressions=/usr/share/pacemaker/tests/valgrind-pcmk.suppressions --gen-suppressions=all
HA_cluster_type=corosync
INVOCATION_ID=5d3831d43d924a08a3dad6f49613e661
OCF_RESOURCE_INSTANCE=nsa-1.ny
HA_quorum_type=corosync
OCF_RA_VERSION_MINOR=0
HA_mcp=true
PCMK_quorum_type=corosync
SHLVL=1
OCF_RESKEY_CRM_meta_on_node=mfs4.ny.local
PCMK_watchdog=false
OCF_RESKEY_CRM_meta_timeout=20000
OCF_RESOURCE_TYPE=lxc
PCMK_logfacility=daemon
LC_ALL=C
JOURNAL_STREAM=9:36160
OCF_RESKEY_CRM_meta_on_node_uuid=2
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/ucb
PCMK_cluster_type=corosync
_=/usr/bin/env

And this how it should looks like

OCF_ROOT=/usr/lib/ocf
OCF_RESKEY_crm_feature_set=3.1.0
HA_LOGFACILITY=daemon
PCMK_debug=0
HA_debug=0
PWD=/var/lib/pacemaker/cores
HA_logfacility=daemon
OCF_EXIT_REASON_PREFIX=ocf-exit-reason:
OCF_RESOURCE_PROVIDER=heartbeat
PCMK_service=pacemaker-execd
PCMK_mcp=true
OCF_RA_VERSION_MAJOR=1
VALGRIND_OPTS=--leak-check=full --trace-children=no --vgdb=no --num-callers=25 --log-file=/var/lib/pacemaker/valgrind-%p --suppressions=/usr/share/pacemaker/tests/valgrind-pcmk.suppressions --gen-suppressions=all
HA_cluster_type=corosync
INVOCATION_ID=b062591edd5142bd952b5ecc4f86b493
OCF_RESKEY_CRM_meta_interval=30000
OCF_RESOURCE_INSTANCE=nsa-1.ny
HA_quorum_type=corosync
OCF_RA_VERSION_MINOR=0
HA_mcp=true
OCF_RESKEY_config=/mnt/cluster_volumes/lxc2/nsa-1.ny/config
PCMK_quorum_type=corosync
OCF_RESKEY_CRM_meta_name=monitor
SHLVL=1
OCF_RESKEY_container=nsa-1.ny
OCF_RESKEY_CRM_meta_on_node=mfs4.ny.local
PCMK_watchdog=false
OCF_RESKEY_CRM_meta_timeout=20000
OCF_RESOURCE_TYPE=lxc
PCMK_logfacility=daemon
LC_ALL=C
JOURNAL_STREAM=9:44603
OCF_RESKEY_CRM_meta_on_node_uuid=2
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/ucb
PCMK_cluster_type=corosync
_=/usr/bin/env

As you can see, there miss some variables like OCF_RESKEY_container or OCF_RESKEY_config

Any ideas? ^_^

iglov avatar Feb 06 '24 10:02 iglov

That's strange. Did you create it without specifying container=<container name> and using -f to force it? What does you pcs resource config output say?

oalbrigt avatar Feb 06 '24 12:02 oalbrigt

Yes, it's very, VERY strange. I create resources with pcs resource create test ocf:heartbeat:lxc container=test config=/mnt/cluster_volumes/lxc1/test/config (you can see it on topic) BUT it does not matter, cuz as i said earlier:

This bug is only on clusters who work without reboot a long time. For example after fencing i can add/remove lxc resources and everything will be fine for a while.

As you can see, almost a year has passed before the bug appeared. This means, i can create resource with ANY method and it WILL work correctly until... something goes wrong. With pcs resource config everything is good:

  Resource: nsa-1.ny (class=ocf provider=heartbeat type=lxc)
   Attributes: config=/mnt/cluster_volumes/lxc2/nsa-1.ny/config container=nsa-1.ny
   Operations: monitor interval=30s timeout=20s (nsa-1.ny-monitor-interval-30s)
               start interval=0s timeout=60s (nsa-1.ny-start-interval-0s)
               stop interval=0s timeout=60s (nsa-1.ny-stop-interval-0s)

Soo-o-o-o, i have no idea how to debug it further :(

iglov avatar Feb 06 '24 12:02 iglov

Can you add the output from rpm -qa | grep pacemaker? So I can have our Pacemaker devs see if this is a known issue.

oalbrigt avatar Feb 06 '24 12:02 oalbrigt

Yep, sure, but i have it on debian:

# dpkg -l | grep pacemaker
ii  pacemaker                            2.0.1-5                      amd64        cluster resource manager
ii  pacemaker-cli-utils                  2.0.1-5                      amd64        cluster resource manager command line utilities
ii  pacemaker-common                     2.0.1-5                      all          cluster resource manager common files
ii  pacemaker-resource-agents            2.0.1-5                      all          cluster resource manager general resource agents

# dpkg -l | grep corosync
ii  corosync                             3.0.1-2+deb10u1              amd64        cluster engine daemon and utilities
ii  corosync-qdevice                     3.0.0-4+deb10u1              amd64        cluster engine quorum device daemon
ii  libcorosync-common4:amd64            3.0.1-2+deb10u1              amd64        cluster engine common library

# dpkg -l | grep resource-agents
ii  pacemaker-resource-agents            2.0.1-5                      all          cluster resource manager general resource agents
ii  resource-agents                      1:4.7.0-1~bpo10+1            amd64        Cluster Resource Agents

# dpkg -l | grep lxc
ii  liblxc1                              1:3.1.0+really3.0.3-8        amd64        Linux Containers userspace tools (library)
ii  lxc                                  1:3.1.0+really3.0.3-8        amd64        Linux Containers userspace tools
ii  lxc-templates                        3.0.4-0+deb10u1              amd64        Linux Containers userspace tools (templates)
ii  lxcfs                                3.0.3-2                      amd64        FUSE based filesystem for LXC

iglov avatar Feb 06 '24 12:02 iglov

@iglov That is extremely odd. If you still have the logs from when that occurred, can you open a bug at bugs.clusterlabs.org and attach the output of crm_report -S --from="YYYY-M-D H:M:S" --to="YYYY-M-D H:M:S" from each node, covering the half hour or so around when the failed stop happened?

kgaillot avatar Feb 06 '24 15:02 kgaillot

I would like to, but i can't, cuz there is a lot of business sensitive information like hostnames, common logs, processlist, even drbd passwords :(

iglov avatar Feb 06 '24 20:02 iglov

I would like to, but i can't, cuz there is a lot of business sensitive information like hostnames, common logs, processlist, even drbd passwords :(

It would be helpful to at least get the scheduler input that led to the problem. At the time the problem occurred, one of the nodes was the designated controller (DC). It will have a log message like "Calculated transition ... saving inputs in ...". The last message before the problem occurred is the interesting one, and the file name is the input. You can uncompress it and edit out any sensitive information, then email it to [email protected].

kgaillot avatar Feb 07 '24 17:02 kgaillot

I would like to, but i can't, cuz there is a lot of business sensitive information like hostnames, common logs, processlist, even drbd passwords :(

It would be helpful to at least get the scheduler input that led to the problem. At the time the problem occurred, one of the nodes was the designated controller (DC). It will have a log message like "Calculated transition ... saving inputs in ...". The last message before the problem occurred is the interesting one, and the file name is the input. You can uncompress it and edit out any sensitive information, then email it to [email protected].

Alternatively you can investigate the file yourself. I'd start with checking the resource configuration and make sure the resource parameters are set correctly there. If they're not, someone or something likely modified the configuration. If they are, the next thing I'd try is crm_simulate -Sx $FILENAME -G graph.xml. The command output should show a stop scheduled on the old node and a start scheduled on the new node (if not, you probably have the wrong input). The graph.xml file should have <rsc_op> entries for the stop and start with all the parameters that will be passed to the agent.

kgaillot avatar Feb 07 '24 17:02 kgaillot

Hey @kgaillot ! Thanks 4 explanations and ur time! Well, i have there something like that

# 0-5 synapses about stonith

<synapse id="6">
  <action_set>
    <rsc_op id="214" operation="stop" operation_key="nsa-1.ny_stop_0" on_node="mfs4.ny.local.priv" on_node_uuid="2">
      <primitive id="nsa-1.ny" class="ocf" provider="heartbeat" type="lxc"/>
      <attributes CRM_meta_on_node="mfs4.ny.local.priv" CRM_meta_on_node_uuid="2" CRM_meta_timeout="20000" crm_feature_set="3.1.0"/>
    </rsc_op>
  </action_set>
  <inputs/>
</synapse>
<synapse id="7">
  <action_set>
    <rsc_op id="33" operation="delete" operation_key="nsa-1.ny_delete_0" on_node="mfs4.ny.local.priv" on_node_uuid="2">
      <primitive id="nsa-1.ny" class="ocf" provider="heartbeat" type="lxc"/>
      <attributes CRM_meta_on_node="mfs4.ny.local.priv" CRM_meta_on_node_uuid="2" CRM_meta_timeout="20000" crm_feature_set="3.1.0"/>
    </rsc_op>
  </action_set>
  <inputs>
    <trigger>
      <rsc_op id="214" operation="stop" operation_key="nsa-1.ny_stop_0" on_node="mfs4.ny.local.priv" on_node_uuid="2"/>
    </trigger>
  </inputs>
</synapse>
<synapse id="8">
  <action_set>
    <rsc_op id="31" operation="delete" operation_key="nsa-1.ny_delete_0" on_node="mfs3.ny.local.priv" on_node_uuid="1">
      <primitive id="nsa-1.ny" class="ocf" provider="heartbeat" type="lxc"/>
      <attributes CRM_meta_on_node="mfs3.ny.local.priv" CRM_meta_on_node_uuid="1" CRM_meta_timeout="20000" crm_feature_set="3.1.0"/>
    </rsc_op>
  </action_set>
  <inputs>
    <trigger>
      <rsc_op id="214" operation="stop" operation_key="nsa-1.ny_stop_0" on_node="mfs4.ny.local.priv" on_node_uuid="2"/>
    </trigger>
  </inputs>
</synapse>
<synapse id="9">
  <action_set>
    <crm_event id="26" operation="clear_failcount" operation_key="nsa-1.ny_clear_failcount_0" on_node="mfs4.ny.local.priv" on_node_uuid="2">
      <primitive id="nsa-1.ny" class="ocf" provider="heartbeat" type="lxc"/>
      <attributes CRM_meta_on_node="mfs4.ny.local.priv" CRM_meta_on_node_uuid="2" CRM_meta_op_no_wait="true" CRM_meta_timeout="20000" crm_feature_set="3.1.0"/>
    </crm_event>
  </action_set>
  <inputs/>
</synapse>

looks good, isn't it? I don't see anything wrong here. But if you still want, i can try to sent you these pe-input files.

iglov avatar Feb 07 '24 21:02 iglov

No, something's wrong. The resource parameters should be listed in <attributes> after the meta-attributes (like config="/mnt/cluster_volumes/lxc2/nsa-1.ny/config" container="nsa-1.ny"). Check the corresponding pe-input to see if those are properly listed under the relevant <primitive>.

kgaillot avatar Feb 08 '24 15:02 kgaillot

Yep, sry, u right, my bad. I tried to find resource nsa-1.ny in pe-input-250 (this one is the last before fuckup) and there is no that primitive there at all. But it is in pe-input-249. Pooof, it's just disappeared...

iglov avatar Feb 08 '24 16:02 iglov