azure
azure copied to clipboard
azure_rm.py Databricks provisioning/deprovision resources causes Azure Inventory failed to update
SUMMARY
Whenever databricks provision/destroying resources as per on-demand basis, the inventory would fail to
ISSUE TYPE
- Bug Report
COMPONENT NAME
ANSIBLE VERSION
ansible 2.9.18
COLLECTION VERSION
latest
CONFIGURATION
OS / ENVIRONMENT
Red Hat Enterprise Linux 7.9
STEPS TO REPRODUCE
- Create inventory sources with the subscription that contains databricks resources
- Refresh the inventory
- As the databricks creating/destroying resources, the inventory update would fail
- Wait for 2 minutes (for resources to complete creation/destruction)
- Run the inventory, it would succeed
EXPECTED RESULTS
- To safely gather the inventory without getting the databricks resources at all
ACTUAL RESULTS
Here are the logs from the Ansible Tower:
2.443 INFO Updating inventory 7: INVDY_SG01
2.499 DEBUG Using base command: python /usr/bin/ansible-inventory -i /tmp/bwrap_8129_l8808mjo/awx_8129_0aoilm_o/azure_rm.yml --playbook-dir /tmp/bwrap_8129_l8808mjo/awx_8129_0aoilm_o -vvvvv
2.499 INFO Reading Ansible inventory source: /tmp/bwrap_8129_l8808mjo/awx_8129_0aoilm_o/azure_rm.yml
2.500 INFO Using VIRTUAL_ENV: /var/lib/awx/venv/ansible
2.500 INFO Using PATH: /var/lib/awx/venv/ansible/bin:/var/lib/awx/venv/awx/bin:/opt/rh/rh-postgresql10/root/usr/bin:/var/lib/awx/venv/awx/bin:/var/lib/awx/venv/awx/bin:/opt/rh/rh-postgresql10/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
2.500 INFO Using PYTHONPATH: /var/lib/awx/venv/ansible/lib/python2.7/site-packages:
2.510 DEBUG Using private credential data in '/tmp/bwrap_8129_l8808mjo/awx_8129_0aoilm_o'.
2.511 DEBUG Using fresh temporary directory '/tmp/awx_proot_o2mt6u8u' for isolation.
2.512 DEBUG Running from `/tmp/bwrap_8129_l8808mjo/awx_8129_0aoilm_o` working directory.
Traceback (most recent call last):
File "/usr/bin/awx-manage", line 11, in <module>
load_entry_point('awx==3.8.2', 'console_scripts', 'awx-manage')()
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/__init__.py", line 154, in manage
execute_from_command_line(sys.argv)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
utility.execute()
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/__init__.py", line 375, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/base.py", line 323, in run_from_argv
self.execute(*args, **cmd_options)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/base.py", line 364, in execute
output = self.handle(*args, **options)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/management/commands/inventory_import.py", line 1142, in handle
raise exc
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/management/commands/inventory_import.py", line 1032, in handle
venv_path=venv_path, verbosity=self.verbosity).load()
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/management/commands/inventory_import.py", line 208, in load
return self.command_to_json(base_args + ['--list'])
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/management/commands/inventory_import.py", line 191, in command_to_json
self.method, proc.returncode, stdout, stderr))
RuntimeError: ansible-inventory failed (rc=1) with stdout:
stderr:
ansible-inventory 2.9.18
config file = /etc/ansible/ansible.cfg
configured module search path = [u'/var/lib/awx/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python2.7/site-packages/ansible
executable location = /usr/bin/ansible-inventory
python version = 2.7.5 (default, Aug 13 2020, 02:51:10) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]
Using /etc/ansible/ansible.cfg as config file
setting up inventory plugins
[WARNING]: * Failed to parse
/tmp/bwrap_8129_l8808mjo/awx_8129_0aoilm_o/azure_rm.yml with auto plugin: a
batched request failed with status code 404, url
/subscriptions/[subscription]/resourceGroups/DATABRICKS-
[resource group]/providers/Microsoft.Compute/virtualMachines/ccd
aa862f41c4c92af342cf11276ac61/instanceView
File "/usr/lib/python2.7/site-packages/ansible/inventory/manager.py", line 280, in parse_source
plugin.parse(self._inventory, self._loader, source, cache=cache)
File "/usr/lib/python2.7/site-packages/ansible/plugins/inventory/auto.py", line 58, in parse
plugin.parse(inventory, loader, path, cache=cache)
File "/var/lib/awx/vendor/awx_ansible_collections/ansible_collections/azure/azcollection/plugins/inventory/azure_rm.py", line 220, in parse
self._get_hosts()
File "/var/lib/awx/vendor/awx_ansible_collections/ansible_collections/azure/azcollection/plugins/inventory/azure_rm.py", line 277, in _get_hosts
self._process_queue_batch()
File "/var/lib/awx/vendor/awx_ansible_collections/ansible_collections/azure/azcollection/plugins/inventory/azure_rm.py", line 419, in _process_queue_batch
raise AnsibleError("a batched request failed with status code {0}, url {1}".format(status_code, result.url))
[WARNING]: Unable to parse
/tmp/bwrap_8129_l8808mjo/awx_8129_0aoilm_o/azure_rm.yml as an inventory source
ERROR! No inventory was parsed, please check your configuration and options.
@nurdiyana-ali From the perspective of your replication steps, this is not a bug. The resource was created, some service was deployed, the state did not change immediately, It can be used after a period of time. For example, it will take about one hour to deploy the API Management Service after it is created each time. Thank you very much!
@Fred-sun Is there a way to force azure_rm.py not to scan for databricks resources? It's failing too often during peak hours.
@nurdiyana-ali I'll take a look. I'm not sure yet. Thank you very much!
Hi @Fred-sun
In line 417 of azure_rm.py, we commented out the line where it checks for status_code to avoid the inventory to stop gathering inventory whenever resources with 404 is returned. We are thinking more can be done for this to avoid terminating the inventory-gathering instead of throwing AnsibleError and halts.
for idx, r in enumerate(batch_resp[key_name]):
status_code = r.get('httpStatusCode')
returned_name = r['name']
result = batch_response_handlers[returned_name]
if status_code != 200:
# FUTURE: error-tolerant operation mode (eg, permissions)
raise AnsibleError("a batched request failed with status code {0}, url {1}".format(status_code, result.url))
# FUTURE: store/handle errors from individual handlers
result.handler(r['content'], **result.handler_args)
Hi,
My customer has also faced this kind of issue. I heard that It caused intermittently but it likely occurred when in "Creating" or "Deleting" stage on maintaining the Databricks VMs.
Can we ignore the corrupted data in the JSON generation procedure or retry to retrieve the correct data as expected?
Here are the outputs of the stack trace for example.
- missing NIC information from VM?
ansible-inventory 2.9.15
config file = /etc/ansible/ansible.cfg
configured module search path = [u'/var/lib/awx/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python2.7/site-packages/ansible
executable location = /usr/bin/ansible-inventory
python version = 2.7.5 (default, May 27 2022, 11:27:32) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]
Using /etc/ansible/ansible.cfg as config file
[WARNING]: * Failed to parse
/tmp/bwrap_115511_vltbu_ah/awx_115511_iubd5xvc/azure_rm.yml with auto plugin:
'id'
File "/usr/lib/python2.7/site-packages/ansible/inventory/manager.py", line 280, in parse_source
plugin.parse(self._inventory, self._loader, source, cache=cache)
File "/usr/lib/python2.7/site-packages/ansible/plugins/inventory/auto.py", line 58, in parse
plugin.parse(inventory, loader, path, cache=cache)
File "/var/lib/awx/vendor/awx_ansible_collections/ansible_collections/azure/azcollection/plugins/inventory/azure_rm.py", line 206, in parse
self._get_hosts()
File "/var/lib/awx/vendor/awx_ansible_collections/ansible_collections/azure/azcollection/plugins/inventory/azure_rm.py", line 263, in _get_hosts
self._process_queue_batch()
File "/var/lib/awx/vendor/awx_ansible_collections/ansible_collections/azure/azcollection/plugins/inventory/azure_rm.py", line 408, in _process_queue_batch
result.handler(r['content'], **result.handler_args)
File "/var/lib/awx/vendor/awx_ansible_collections/ansible_collections/azure/azcollection/plugins/inventory/azure_rm.py", line 616, in _on_pip_response
self.public_ips[pip_model['id']] = AzurePip(pip_model)
[WARNING]: Unable to parse
/tmp/bwrap_115511_vltbu_ah/awx_115511_iubd5xvc/azure_rm.yml as an inventory
source
ERROR! No inventory was parsed, please check your configuration and options.
- Another missing of NIC information
[WARNING]: * Failed to parse
/tmp/bwrap_114098_12c3drzu/awx_114098_ktpd7w7l/azure_rm.yml with auto plugin:
'properties'
File "/usr/lib/python2.7/site-packages/ansible/inventory/manager.py", line 280, in parse_source
plugin.parse(self._inventory, self._loader, source, cache=cache)
File "/usr/lib/python2.7/site-packages/ansible/plugins/inventory/auto.py", line 58, in parse
plugin.parse(inventory, loader, path, cache=cache)
File "/var/lib/awx/vendor/awx_ansible_collections/ansible_collections/azure/azcollection/plugins/inventory/azure_rm.py", line 206, in parse
self._get_hosts()
File "/var/lib/awx/vendor/awx_ansible_collections/ansible_collections/azure/azcollection/plugins/inventory/azure_rm.py", line 277, in _get_hosts
if self._filter_host(inventory_hostname, h.hostvars):
File "/var/lib/awx/vendor/awx_ansible_collections/ansible_collections/azure/azcollection/plugins/inventory/azure_rm.py", line 539, in hostvars
for ipc in sorted(nic._nic_model['properties']['ipConfigurations'], key=lambda i: i['properties'].get('primary', False), reverse=True):
[WARNING]: Unable to parse
/tmp/bwrap_114098_12c3drzu/awx_114098_ktpd7w7l/azure_rm.yml as an inventory
source
ERROR! No inventory was parsed, please check your configuration and options.
Can anyone of the development team investigate this issue in order to avoid the error? If you need more details to reproduce it, please let me know. Thank you in advance.
@sugitk If you can provide a detailed copy process, use cases and logs, it will help to solve this problem. Thank you!
@sugitk @nurdiyana-ali Since this problem is difficult to be repeated on my side and there is no effective solution yet, I think since this problem is prone to occur on your side, please replace the following content and ask whether similar problems will occur on your side. Doubt: An error (restart, shutdown, delete, etc.) that may have occurred for some reason while retrieving object information. Thank you very much!
426 if status_code != 200:
427 # FUTURE: error-tolerant operation mode (eg, permissions)
428 raise AnsibleError("a batched request failed with status code {0}, url {1}".format(status_code, result.url))
Changed as below:
if status_code != 200:
# FUTURE: error-tolerant operation mode (eg, permissions)
if status_code >= 500:
pass
else:
raise AnsibleError("a batched request failed with status code {0}, url {1}".format(status_code, result.url))
@Fred-sun I’m sending you the above answer instead of @sugitk. Please be aware that Ansible can retrieve the VM information from Azure using the ansible inventory plugin (azure_rm.py) in the REST API. When Databricks cluster is under maintenance like provisioning or deprovision state, the VM information is likely corrupted in NIC information and so on. Then the inventory plugin fails to update the inventory. I am not sure our client still has the same environment for the test or not. Could you please specifically in detail how to replace the content that you suggested above?
FYI…We tried to communicate with the Microsoft Azure team directly regarding this issue and we identified the root cause of the issue but it is not implemented yet.
- From the Azure perspective, there are two separated resources (VM and NIC).
- The resources are out of synchronization, hence the VM information may not have the NIC information when the NIC is not associated with it under some situation.
- We requested the fix in order not to cause the error.
@garamirseokim PR has been proposed for this purpose. If the VM is abnormal, it is not obtained and other VMS are not affected. For details, see PR #1157. You can directly replace the updated file in PR or directly change it locally. Thanks!
@Fred-sun will wait for the release. Please kindly let me know once it is released.
Thanks.
@garamirseokim Ok, thank you for your attention!