ansible-datadog
ansible-datadog copied to clipboard
datadog_disable_untracked_checks: true should be the default behavior and is too slow
Also, and please correct me if I am wrong, there is currently no way to "remove" a configuration which was previously configured without also using `datadog_disable_untracked_checks: true` and needing to list all tracked checks?
Originally posted by @rockaut in https://github.com/DataDog/ansible-datadog/issues/366#issuecomment-1313844486
This flag appears to be the only way to delete conf.d entries previously configured by merge/combining into datadog_checks.
Problem 1
- Configure some checks for your instance
- Decide you don't need them anymore
- Remove the configuration
Expected Result: Configs are removed when running the datadog playbook
Actual Result: Defaults for datadog_disable_untracked_checks is false and therefore config is not removed, just orphaned!
Problem 2
- Enable deletion of untracked checks with datadog_disable_untracked_checks: true
- Run playbook
Expected result: Relatively fast iteration through all the directories and deletion of unused checks.
Actual result: several wall time minutes on an idle linux virtual machine with 4 cores, 8GiB RAM and an Amazon gp2 filesystem.
I have verified that the filesystem operations do not take very long on the same host. What's the most effective way to get timing information to add to this bug report from ansible?
# time for f in /etc/datadog-agent/conf.d/*.d/*; do test -f ${f} && cat ${f} >/dev/null; done
real 0m0.174s
user 0m0.092s
sys 0m0.095s
The playbook was run over a vpn connection and there is some latency for ssh so the bulk of the time is because a new ssh connection is done for each checked file rather than say batching the operation on node or similar. By default without customization there are 173 checks as of this bug being filed. Each file scp takes at least 1 second, and in my case, usually closer to 3!
This is due to the use of loop in https://github.com/DataDog/ansible-datadog/blob/bf03a08f2c8a9f787d7acd0c91fd970a5783c743/tasks/_agent-linux-macos-shared.yml#L30 and is painful!
There are multiple ways to address this including:
- async and poll keywords to parallelize each loop item
- higher parallelism keywords
- computing and flattening the list of files to be removed in a separate operation and passing the deletion result to ensure the file is absent (but this would, I think, require a new file management module as ansible is notorious for not supporting bulk operations like this)
- probably others though we run into general "ansible is slow" challenges
I have verified that the filesystem operations do not take very long on the same host. What's the most effective way to get timing information to add to this bug report from ansible?
Have a look at ansible callback plugins.
https://docs.ansible.com/ansible/latest/plugins/callback.html and https://docs.ansible.com/ansible/latest/collections/index_callback.html
Basicall you put
[defaults]
callbacks_enabled = ansible.posix.profile_roles,ansible.posix.profile_tasks
to ansible.cfg and run again. It's great. But I would suggest to don't have it active all the time. It sometimes bugs up the logs pretty hefty :D
On the other thing: I yesterday had the idea to generate the the configs first on localhost and the rsync (ansible.posix.synchronize) the whole "package" to the hosts.
So:
- categorize play to remote hosts with ansible_facts, services, packages, etc. (maybe cached)
- configurize play with delegate_to localhost to create a /tmp/{{ inventory_hostname }}/conf.d
- rollout play to sync it to hosts
I will try that in the next weeks. That should speed up things imensly. Additionally I thought about package it up in a tar/zip and keep it on the remotes to md5sum it? IDK
~~I also still have the idea in my backhead to use consul/etcd so we don't need that at all. This way we just need to use the datadog.datadog for first install with minimal settings and upgrades. The configuration plays then sends the configs to the KV store. But that's something for the summer to try out I guess :D Also I don't know if that is something DD guys even "support" - I mean it's in the configs (datadog.yaml - config_providers) but I will try to get in touch with them about this befor investing too much time.~~ I tried it but realized that is only for containerized environments as it only looks for the container id (basically. spec.containers[0].image) and not for hostnames ... well it was a try.
Hi :wave: thanks for opening the issue and sorry for taking so long to respond. Let me try to address your problems:
Problem 1
I see what you mean here. If a check was configured through Ansible and then you remove its configuration, the check should be disabled. I agree that this would be a reasonable expectation. I would be worried about making this change though, because people might be counting of this to work the way it is now and any change could result in unexpected data loss for existing users. I think this would be something to address in a new major version of the role, with the following note:
The question is, what happens in this case to checks configured by other means when datadog_disable_untracked_checks
is false
? Because in this case, I think checks configured outside of Ansible shouldn't be deactivated (?)
Here's what I think we could do:
- Create a file, e.g. in
/etc/datadog-agent/
, which would contain list of checks configured through Ansible. - (when
datadog_disable_untracked_checks
isfalse
) If we find out during the role execution that a check from the file is no longer configured, deactivate it. (whendatadog_disable_untracked_checks
istrue
, we would deactivate it anyway). Does that make sense? (Again, this would probably only land in a new major version).
Problem 2
Right, so this is an optimization issue that looks like it should be addressable; I acknowledge that this might be an issue. I'll make sure to put this on our radar and we'll see what we can do about it.
@bkabrda any update on problem 2? It's very painful :/ Or is there maybe a workaround?
Problem 2 (the speed issue) was solved by #584.
As for Problem 1 (enabling this by default), as mentioned, changing this behavior by default would be backwards incompatible, and would thus require care with the versioning and such. I'll create a backlog card to track this but it's unlikely it will be tackled short-term.