badfish icon indicating copy to clipboard operation
badfish copied to clipboard

[BUG] Issue with batch processing host list from the container

Open sanjaychari opened this issue 2 years ago • 1 comments

Your System Details

  • Operating System: RHEL 7.8
  • Target System Type: Dell
  • Podman version 1.6.4

Describe the bug This is an issue that occurs intermittently on Dell hosts. While trying to run actions related to boot order in a batch through the --host-list parameter, the action fails on some hosts.

[sanjay@perfc-360g8-04 jetpack]$ podman run -it -v /home/sanjay/jetpack/badfish:/dell --rm quay.io/quads/badfish --host-list /dell/dell-hosts -u quads -p <password> -i config/idrac_interfaces.yml --check-boot
[mgmt-e24-h25-740xd] - INFO     - Executing actions on host: mgmt-e24-h25-740xd.example.com
[mgmt-e24-h25-740xd] - WARNING  - Current boot order is set to: director.
[mgmt-e24-h25-740xd] - INFO     - ************************************************
[mgmt-e24-h27-740xd] - INFO     - Executing actions on host: mgmt-e24-h27-740xd.example.com
[mgmt-e24-h27-740xd] - WARNING  - Current boot order is set to: director.
[mgmt-e24-h27-740xd] - INFO     - ************************************************
[mgmt-e24-h29-740xd] - ERROR    - Failed to communicate with mgmt-e24-h29-740xd.example.com
[mgmt-e24-h29-740xd] - INFO     - ************************************************
[mgmt-e24-h31-740xd] - ERROR    - Failed to communicate with mgmt-e24-h31-740xd.example.com
[mgmt-e24-h31-740xd] - INFO     - ************************************************
[mgmt-e24-h33-740xd] - INFO     - Executing actions on host: mgmt-e24-h33-740xd.example.com
[mgmt-e24-h33-740xd] - WARNING  - Current boot order is set to: director.
[mgmt-e24-h33-740xd] - INFO     - ************************************************
[src.badfish.helpers.logger] - INFO     - RESULTS:
[src.badfish.helpers.logger] - INFO     - mgmt-e24-h25-740xd.alias.bos.scalelab.redhat.com: SUCCESSFUL
[src.badfish.helpers.logger] - INFO     - mgmt-e24-h27-740xd.alias.bos.scalelab.redhat.com: SUCCESSFUL
[src.badfish.helpers.logger] - INFO     - mgmt-e24-h29-740xd.alias.bos.scalelab.redhat.com: FAILED
[src.badfish.helpers.logger] - INFO     - mgmt-e24-h31-740xd.alias.bos.scalelab.redhat.com: FAILED
[src.badfish.helpers.logger] - INFO     - mgmt-e24-h33-740xd.alias.bos.scalelab.redhat.com: SUCCESSFUL

However, if the same action is run individually through the badfish python script on the failed hosts, it is successful.

(venv) (base) [schari@schari badfish]$ python3 src/badfish/badfish.py -H mgmt-e24-h31-740xd.example.com -u quads -p <password> -i config/idrac_interfaces.yml --check-boot
- WARNING  - Current boot order is set to: director.
(venv) (base) [schari@schari badfish]$ python3 src/badfish/badfish.py -H mgmt-e24-h31-740xd.example.com -u quads -p <password> -i config/idrac_interfaces.yml --check-boot
- WARNING  - Current boot order is set to: director.

After some time, the action runs successfully through batch processing on the container too. However, this is after a long time from when the action is successful on individual hosts through the python script.

Expected Behavior Batch processing of hosts through the badfish container should return the same results for all hosts at the same time as individual processing of the same hosts through the badfish python script.

sanjaychari avatar Oct 18 '22 05:10 sanjaychari

The IDRAC is becoming unresponsive when containerized badfish is performing bulk actions, due to which I think it fails saying that "failed to communicate with host"

rajeshP524 avatar Oct 21 '22 06:10 rajeshP524

Hey @sanjaychari in your command example I don't see you mapping your --host-list target file anywhere, is that being mapped? e.g. -v /tmp/my-hosts:/tmp/my-hosts:z so you can refer to it with --host-list /tmp/myhosts for example.

On iDRAC instability, they are just fragile. It's recommended to reboot them via --racreset before doing bulk operations on them to alleviate problems, they aren't designed to take a lot of heavy API calls unfortunately and there's not a lot we can do about that.

sadsfae avatar Dec 14 '22 18:12 sadsfae

Hi @sadsfae,

The hosts file dell-hosts is at /home/sanjay/jetpack/badfish on the host. I am mounting the entire directory onto the badfish container with -v /home/sanjay/jetpack/badfish:/dell and then passing the hosts list using --host-list /dell/dell-hosts.

We already perform a racreset on all systems in Jetpack before setting the boot order(https://github.com/redhat-performance/jetpack/blob/b48d6bbe632a4f774605467c449ca2f68928f7cf/set_boot_order.yml#L11-L26).

sanjaychari avatar Dec 15 '22 04:12 sanjaychari

@sanjaychari ah ok, maybe you can show us or we can look at the hosts internally? What part of your playbook is returning the unavailable return status? cc: @grafuls

Is the behavior sporadic in that sometimes it works and sometimes it doesn't?

Have you tried also forcefully clearing the iDRAC job queue via --clear-jobs --force ? Give this a try too and then reboot any ones that don't return like you want.

If the behavior is sporadic it's probably not a badfish issue otherwise it wouldn't work at all but maybe we can isolate and remedy the behavior on the idrac side.

sadsfae avatar Dec 15 '22 11:12 sadsfae

Closing this due to inactivity, we haven't received any other similar reports and there's been loads of container runs since then.

sadsfae avatar Apr 05 '23 12:04 sadsfae