ERROR: LXC container name not set!
OS: Debian 10/11/12 Kernel: 5.10.0-15-amd64 - 6.1.0-18-amd64 Env (depens on deb ver):
- resource-agents 1:4.7.0-1~bpo10+1, pacemaker 2.0.5-2, corosync 3.1.2-2, lxc 1:4.0.6-2
- resource-agents 1:4.12.0-2, pacemaker 2.1.5-1+deb12u1, corosync 3.1.7-1, lxc 1:5.0.2-1+deb12u2
Just trying to add new resource
lxc-start -n front-2.fr
pcs resource create front-2.fr ocf:heartbeat:lxc config=/mnt/cluster_volumes/lxc2/front-2.fr/config container=front-2.fr
After ~5min want to remove it
pcs resource remove front-2.fr --force
got an error and cluster starts to migrate
Mar 29 23:28:51 cse2.fr lxc(front-2.fr)[2103391]: ERROR: LXC container name not set!
as i can see in /usr/lib/ocf/resource.d/heartbeat/lxc the error spawns when agent can't get OCF_RESKEY_container variable.
This bug is only on clusters who work without reboot a long time. For example after fencing i can add/remove lxc resources and everything will be fine for a while.
The question is: why? And how to debug it?
This might be due to the probe-action.
You can try changing https://github.com/ClusterLabs/resource-agents/blob/fe1a2f88ac32dfaba86baf995094e2b4fa0d8def/heartbeat/lxc.in#L343
to ocf_is_probe || LXC_validate.
Seems like the agent already takes care of probe-actions, so I'll have to investigate further what might cause it.
Hey @oalbrigt , thanks 4 reply!
to
ocf_is_probe || LXC_validate.
Yep, ofc i can try, but what the point if as we can see, the OCF_RESKEY_container var isn't exists or the agent just doesn't know anything about it. So even if i'll try it, he wont stop the container here for the same reason https://github.com/ClusterLabs/resource-agents/blob/fe1a2f88ac32dfaba86baf995094e2b4fa0d8def/heartbeat/lxc.in#L184
@kgaillot Do you know what might cause OCF_RESKEY_ variables not being set when doing pcs resource remove --force?
@kgaillot Do you know what might cause OCF_RESKEY_ variables not being set when doing
pcs resource remove --force?
No, that's odd. Was the command tried without --force first? It shouldn't normally be necessary, so if it was, that might point to an issue.
Hey @kgaillot , thx 4 reply!
Nope, without --force the result is the same.
@iglov @oalbrigt , can one of you try dumping the environment to a file from within the stop command? Are no OCF variables set, or is just that one missing?
Well, i can try if you tell me how to do that and if i find cluster in the same state.
Something like env > /run/lxc.env in the agent's stop action
Oh, you mean i should place env > /run/lxc.env somewhere in the /usr/lib/ocf/resource.d/heartbeat/lxc in LXC_stop() { ... } ? But it won't work cuz: 1. It died before LXC_stop() in the LXC_validate() ; 2. After fencing node will reboot and/run unmounts. So, i think it would be better to put env > /root/lxc.env in LXC_validate()
If all correct i will try when find the cluster with this bug.
That sounds right
Hey guyz! I got it. Tried to stop container nsa-1.ny with pcs resource remove nsa-1.ny --force and got some debug:
OCF_ROOT=/usr/lib/ocf
OCF_RESKEY_crm_feature_set=3.1.0
HA_LOGFACILITY=daemon
PCMK_debug=0
HA_debug=0
PWD=/var/lib/pacemaker/cores
HA_logfacility=daemon
OCF_EXIT_REASON_PREFIX=ocf-exit-reason:
OCF_RESOURCE_PROVIDER=heartbeat
PCMK_service=pacemaker-execd
PCMK_mcp=true
OCF_RA_VERSION_MAJOR=1
VALGRIND_OPTS=--leak-check=full --trace-children=no --vgdb=no --num-callers=25 --log-file=/var/lib/pacemaker/valgrind-%p --suppressions=/usr/share/pacemaker/tests/valgrind-pcmk.suppressions --gen-suppressions=all
HA_cluster_type=corosync
INVOCATION_ID=5d3831d43d924a08a3dad6f49613e661
OCF_RESOURCE_INSTANCE=nsa-1.ny
HA_quorum_type=corosync
OCF_RA_VERSION_MINOR=0
HA_mcp=true
PCMK_quorum_type=corosync
SHLVL=1
OCF_RESKEY_CRM_meta_on_node=mfs4.ny.local
PCMK_watchdog=false
OCF_RESKEY_CRM_meta_timeout=20000
OCF_RESOURCE_TYPE=lxc
PCMK_logfacility=daemon
LC_ALL=C
JOURNAL_STREAM=9:36160
OCF_RESKEY_CRM_meta_on_node_uuid=2
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/ucb
PCMK_cluster_type=corosync
_=/usr/bin/env
And this how it should looks like
OCF_ROOT=/usr/lib/ocf
OCF_RESKEY_crm_feature_set=3.1.0
HA_LOGFACILITY=daemon
PCMK_debug=0
HA_debug=0
PWD=/var/lib/pacemaker/cores
HA_logfacility=daemon
OCF_EXIT_REASON_PREFIX=ocf-exit-reason:
OCF_RESOURCE_PROVIDER=heartbeat
PCMK_service=pacemaker-execd
PCMK_mcp=true
OCF_RA_VERSION_MAJOR=1
VALGRIND_OPTS=--leak-check=full --trace-children=no --vgdb=no --num-callers=25 --log-file=/var/lib/pacemaker/valgrind-%p --suppressions=/usr/share/pacemaker/tests/valgrind-pcmk.suppressions --gen-suppressions=all
HA_cluster_type=corosync
INVOCATION_ID=b062591edd5142bd952b5ecc4f86b493
OCF_RESKEY_CRM_meta_interval=30000
OCF_RESOURCE_INSTANCE=nsa-1.ny
HA_quorum_type=corosync
OCF_RA_VERSION_MINOR=0
HA_mcp=true
OCF_RESKEY_config=/mnt/cluster_volumes/lxc2/nsa-1.ny/config
PCMK_quorum_type=corosync
OCF_RESKEY_CRM_meta_name=monitor
SHLVL=1
OCF_RESKEY_container=nsa-1.ny
OCF_RESKEY_CRM_meta_on_node=mfs4.ny.local
PCMK_watchdog=false
OCF_RESKEY_CRM_meta_timeout=20000
OCF_RESOURCE_TYPE=lxc
PCMK_logfacility=daemon
LC_ALL=C
JOURNAL_STREAM=9:44603
OCF_RESKEY_CRM_meta_on_node_uuid=2
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/ucb
PCMK_cluster_type=corosync
_=/usr/bin/env
As you can see, there miss some variables like OCF_RESKEY_container or OCF_RESKEY_config
Any ideas? ^_^
That's strange. Did you create it without specifying container=<container name> and using -f to force it? What does you pcs resource config output say?
Yes, it's very, VERY strange. I create resources with pcs resource create test ocf:heartbeat:lxc container=test config=/mnt/cluster_volumes/lxc1/test/config (you can see it on topic) BUT it does not matter, cuz as i said earlier:
This bug is only on clusters who work without reboot a long time. For example after fencing i can add/remove lxc resources and everything will be fine for a while.
As you can see, almost a year has passed before the bug appeared. This means, i can create resource with ANY method and it WILL work correctly until... something goes wrong.
With pcs resource config everything is good:
Resource: nsa-1.ny (class=ocf provider=heartbeat type=lxc)
Attributes: config=/mnt/cluster_volumes/lxc2/nsa-1.ny/config container=nsa-1.ny
Operations: monitor interval=30s timeout=20s (nsa-1.ny-monitor-interval-30s)
start interval=0s timeout=60s (nsa-1.ny-start-interval-0s)
stop interval=0s timeout=60s (nsa-1.ny-stop-interval-0s)
Soo-o-o-o, i have no idea how to debug it further :(
Can you add the output from rpm -qa | grep pacemaker? So I can have our Pacemaker devs see if this is a known issue.
Yep, sure, but i have it on debian:
# dpkg -l | grep pacemaker
ii pacemaker 2.0.1-5 amd64 cluster resource manager
ii pacemaker-cli-utils 2.0.1-5 amd64 cluster resource manager command line utilities
ii pacemaker-common 2.0.1-5 all cluster resource manager common files
ii pacemaker-resource-agents 2.0.1-5 all cluster resource manager general resource agents
# dpkg -l | grep corosync
ii corosync 3.0.1-2+deb10u1 amd64 cluster engine daemon and utilities
ii corosync-qdevice 3.0.0-4+deb10u1 amd64 cluster engine quorum device daemon
ii libcorosync-common4:amd64 3.0.1-2+deb10u1 amd64 cluster engine common library
# dpkg -l | grep resource-agents
ii pacemaker-resource-agents 2.0.1-5 all cluster resource manager general resource agents
ii resource-agents 1:4.7.0-1~bpo10+1 amd64 Cluster Resource Agents
# dpkg -l | grep lxc
ii liblxc1 1:3.1.0+really3.0.3-8 amd64 Linux Containers userspace tools (library)
ii lxc 1:3.1.0+really3.0.3-8 amd64 Linux Containers userspace tools
ii lxc-templates 3.0.4-0+deb10u1 amd64 Linux Containers userspace tools (templates)
ii lxcfs 3.0.3-2 amd64 FUSE based filesystem for LXC
@iglov That is extremely odd. If you still have the logs from when that occurred, can you open a bug at bugs.clusterlabs.org and attach the output of crm_report -S --from="YYYY-M-D H:M:S" --to="YYYY-M-D H:M:S" from each node, covering the half hour or so around when the failed stop happened?
I would like to, but i can't, cuz there is a lot of business sensitive information like hostnames, common logs, processlist, even drbd passwords :(
I would like to, but i can't, cuz there is a lot of business sensitive information like hostnames, common logs, processlist, even drbd passwords :(
It would be helpful to at least get the scheduler input that led to the problem. At the time the problem occurred, one of the nodes was the designated controller (DC). It will have a log message like "Calculated transition ... saving inputs in ...". The last message before the problem occurred is the interesting one, and the file name is the input. You can uncompress it and edit out any sensitive information, then email it to [email protected].
I would like to, but i can't, cuz there is a lot of business sensitive information like hostnames, common logs, processlist, even drbd passwords :(
It would be helpful to at least get the scheduler input that led to the problem. At the time the problem occurred, one of the nodes was the designated controller (DC). It will have a log message like "Calculated transition ... saving inputs in ...". The last message before the problem occurred is the interesting one, and the file name is the input. You can uncompress it and edit out any sensitive information, then email it to [email protected].
Alternatively you can investigate the file yourself. I'd start with checking the resource configuration and make sure the resource parameters are set correctly there. If they're not, someone or something likely modified the configuration. If they are, the next thing I'd try is crm_simulate -Sx $FILENAME -G graph.xml. The command output should show a stop scheduled on the old node and a start scheduled on the new node (if not, you probably have the wrong input). The graph.xml file should have <rsc_op> entries for the stop and start with all the parameters that will be passed to the agent.
Hey @kgaillot ! Thanks 4 explanations and ur time! Well, i have there something like that
# 0-5 synapses about stonith
<synapse id="6">
<action_set>
<rsc_op id="214" operation="stop" operation_key="nsa-1.ny_stop_0" on_node="mfs4.ny.local.priv" on_node_uuid="2">
<primitive id="nsa-1.ny" class="ocf" provider="heartbeat" type="lxc"/>
<attributes CRM_meta_on_node="mfs4.ny.local.priv" CRM_meta_on_node_uuid="2" CRM_meta_timeout="20000" crm_feature_set="3.1.0"/>
</rsc_op>
</action_set>
<inputs/>
</synapse>
<synapse id="7">
<action_set>
<rsc_op id="33" operation="delete" operation_key="nsa-1.ny_delete_0" on_node="mfs4.ny.local.priv" on_node_uuid="2">
<primitive id="nsa-1.ny" class="ocf" provider="heartbeat" type="lxc"/>
<attributes CRM_meta_on_node="mfs4.ny.local.priv" CRM_meta_on_node_uuid="2" CRM_meta_timeout="20000" crm_feature_set="3.1.0"/>
</rsc_op>
</action_set>
<inputs>
<trigger>
<rsc_op id="214" operation="stop" operation_key="nsa-1.ny_stop_0" on_node="mfs4.ny.local.priv" on_node_uuid="2"/>
</trigger>
</inputs>
</synapse>
<synapse id="8">
<action_set>
<rsc_op id="31" operation="delete" operation_key="nsa-1.ny_delete_0" on_node="mfs3.ny.local.priv" on_node_uuid="1">
<primitive id="nsa-1.ny" class="ocf" provider="heartbeat" type="lxc"/>
<attributes CRM_meta_on_node="mfs3.ny.local.priv" CRM_meta_on_node_uuid="1" CRM_meta_timeout="20000" crm_feature_set="3.1.0"/>
</rsc_op>
</action_set>
<inputs>
<trigger>
<rsc_op id="214" operation="stop" operation_key="nsa-1.ny_stop_0" on_node="mfs4.ny.local.priv" on_node_uuid="2"/>
</trigger>
</inputs>
</synapse>
<synapse id="9">
<action_set>
<crm_event id="26" operation="clear_failcount" operation_key="nsa-1.ny_clear_failcount_0" on_node="mfs4.ny.local.priv" on_node_uuid="2">
<primitive id="nsa-1.ny" class="ocf" provider="heartbeat" type="lxc"/>
<attributes CRM_meta_on_node="mfs4.ny.local.priv" CRM_meta_on_node_uuid="2" CRM_meta_op_no_wait="true" CRM_meta_timeout="20000" crm_feature_set="3.1.0"/>
</crm_event>
</action_set>
<inputs/>
</synapse>
looks good, isn't it? I don't see anything wrong here. But if you still want, i can try to sent you these pe-input files.
No, something's wrong. The resource parameters should be listed in <attributes> after the meta-attributes (like config="/mnt/cluster_volumes/lxc2/nsa-1.ny/config" container="nsa-1.ny"). Check the corresponding pe-input to see if those are properly listed under the relevant <primitive>.
Yep, sry, u right, my bad. I tried to find resource nsa-1.ny in pe-input-250 (this one is the last before fuckup) and there is no that primitive there at all. But it is in pe-input-249. Pooof, it's just disappeared...