badfish
badfish copied to clipboard
[BUG] Issue with batch processing host list from the container
Your System Details
- Operating System: RHEL 7.8
- Target System Type: Dell
- Podman version
1.6.4
Describe the bug This is an issue that occurs intermittently on Dell hosts. While trying to run actions related to boot order in a batch through the --host-list parameter, the action fails on some hosts.
[sanjay@perfc-360g8-04 jetpack]$ podman run -it -v /home/sanjay/jetpack/badfish:/dell --rm quay.io/quads/badfish --host-list /dell/dell-hosts -u quads -p <password> -i config/idrac_interfaces.yml --check-boot
[mgmt-e24-h25-740xd] - INFO - Executing actions on host: mgmt-e24-h25-740xd.example.com
[mgmt-e24-h25-740xd] - WARNING - Current boot order is set to: director.
[mgmt-e24-h25-740xd] - INFO - ************************************************
[mgmt-e24-h27-740xd] - INFO - Executing actions on host: mgmt-e24-h27-740xd.example.com
[mgmt-e24-h27-740xd] - WARNING - Current boot order is set to: director.
[mgmt-e24-h27-740xd] - INFO - ************************************************
[mgmt-e24-h29-740xd] - ERROR - Failed to communicate with mgmt-e24-h29-740xd.example.com
[mgmt-e24-h29-740xd] - INFO - ************************************************
[mgmt-e24-h31-740xd] - ERROR - Failed to communicate with mgmt-e24-h31-740xd.example.com
[mgmt-e24-h31-740xd] - INFO - ************************************************
[mgmt-e24-h33-740xd] - INFO - Executing actions on host: mgmt-e24-h33-740xd.example.com
[mgmt-e24-h33-740xd] - WARNING - Current boot order is set to: director.
[mgmt-e24-h33-740xd] - INFO - ************************************************
[src.badfish.helpers.logger] - INFO - RESULTS:
[src.badfish.helpers.logger] - INFO - mgmt-e24-h25-740xd.alias.bos.scalelab.redhat.com: SUCCESSFUL
[src.badfish.helpers.logger] - INFO - mgmt-e24-h27-740xd.alias.bos.scalelab.redhat.com: SUCCESSFUL
[src.badfish.helpers.logger] - INFO - mgmt-e24-h29-740xd.alias.bos.scalelab.redhat.com: FAILED
[src.badfish.helpers.logger] - INFO - mgmt-e24-h31-740xd.alias.bos.scalelab.redhat.com: FAILED
[src.badfish.helpers.logger] - INFO - mgmt-e24-h33-740xd.alias.bos.scalelab.redhat.com: SUCCESSFUL
However, if the same action is run individually through the badfish python script on the failed hosts, it is successful.
(venv) (base) [schari@schari badfish]$ python3 src/badfish/badfish.py -H mgmt-e24-h31-740xd.example.com -u quads -p <password> -i config/idrac_interfaces.yml --check-boot
- WARNING - Current boot order is set to: director.
(venv) (base) [schari@schari badfish]$ python3 src/badfish/badfish.py -H mgmt-e24-h31-740xd.example.com -u quads -p <password> -i config/idrac_interfaces.yml --check-boot
- WARNING - Current boot order is set to: director.
After some time, the action runs successfully through batch processing on the container too. However, this is after a long time from when the action is successful on individual hosts through the python script.
Expected Behavior Batch processing of hosts through the badfish container should return the same results for all hosts at the same time as individual processing of the same hosts through the badfish python script.
The IDRAC is becoming unresponsive when containerized badfish is performing bulk actions, due to which I think it fails saying that "failed to communicate with host"
Hey @sanjaychari in your command example I don't see you mapping your --host-list
target file anywhere, is that being mapped? e.g. -v /tmp/my-hosts:/tmp/my-hosts:z
so you can refer to it with --host-list /tmp/myhosts
for example.
On iDRAC instability, they are just fragile. It's recommended to reboot them via --racreset
before doing bulk operations on them to alleviate problems, they aren't designed to take a lot of heavy API calls unfortunately and there's not a lot we can do about that.
Hi @sadsfae,
The hosts file dell-hosts
is at /home/sanjay/jetpack/badfish
on the host. I am mounting the entire directory onto the badfish container with -v /home/sanjay/jetpack/badfish:/dell
and then passing the hosts list using --host-list /dell/dell-hosts
.
We already perform a racreset on all systems in Jetpack before setting the boot order(https://github.com/redhat-performance/jetpack/blob/b48d6bbe632a4f774605467c449ca2f68928f7cf/set_boot_order.yml#L11-L26).
@sanjaychari ah ok, maybe you can show us or we can look at the hosts internally? What part of your playbook is returning the unavailable return status? cc: @grafuls
Is the behavior sporadic in that sometimes it works and sometimes it doesn't?
Have you tried also forcefully clearing the iDRAC job queue via --clear-jobs --force
? Give this a try too and then reboot any ones that don't return like you want.
If the behavior is sporadic it's probably not a badfish issue otherwise it wouldn't work at all but maybe we can isolate and remedy the behavior on the idrac side.
Closing this due to inactivity, we haven't received any other similar reports and there's been loads of container runs since then.