community.general
community.general copied to clipboard
opentelemetry: memory "leak"/inefficient data handling
Summary
we have a playbook which does a lot of loops with quite a bit of data handling. the opentelemetry callback plugin seems to collect a lot of data and the memory consumption grows really fast. the opentelemetry callback plugin uses more than 24GB of memory for this playbook. we already disabled logs and logs in attributes. but that doesn't make a change as this doesn't change the data collection at all.
Issue Type
Bug Report
Component Name
opentelemetry callback plugin
Ansible Version
independent from any ansible version
Community.general Version
8.6.0
Configuration
$ ansible-config dump --only-changed
OS / Environment
CentOS 7
Steps to Reproduce
just any playbook with a lot of data in the ansible "result"
Expected Results
the opentelemetry playbook shouldn't collect all the responses. and it should also send the data continuously and not collect until the end. this would help to free up the memory.
Actual Results
Code of Conduct
- [X] I agree to follow the Ansible Code of Conduct
Files identified in the description:
lib/ansible/plugins/callback
If these files are incorrect, please update the component name section of the description or use the !component bot command.
!component =plugins/callback/opentelemetry.py
Files identified in the description:
If these files are incorrect, please update the component name section of the description or use the !component bot command.
cc @v1v click here for bot help
Hi @OneCyrus,
can you provide a bit more context about your setup?
- Ansible version
- Python version
- OpenTelemetry libraries:
- opentelemetry-api
- opentelemetry-exporter-otlp
- opentelemetry-sdk
In addition, can you share your playbook (without any sensitive details) to help with reproducing the error?
Thanks
Some env variables can help to reduce the number of attributes that are stored per span/event:
In addition, the existing implementation uses the BatchProcessor, and it can be configured with some further environment variables as explained here
Have you tried to configure those environment variables? Did you see any improvements?
the playbook is rather simple.
playbook.yaml
- hosts: "localhost"
vars:
efficientip_system: "fqdn"
efficientip_user: "user"
efficientip_password: "pw"
tasks:
- name: Get IP address count for page numbers
ansible.builtin.uri:
url: "https://{{ efficientip_system }}/api/v2.0/ipam/address/count"
validate_certs: no
url_username: "{{ efficientip_user }}"
url_password: "{{ efficientip_password }}"
register: ipaddresses_count
- name: Get all IP addresses
ansible.builtin.include_tasks: efficientip_getpage_addresses.yaml
with_sequence: start=0 end="{{ ipaddresses_count.json.data[0].count }}" stride=1000
efficientip_getpage_addresses.yaml
- name: "Get all IP addresses: {{ item }}"
ansible.builtin.uri:
url: "https://{{ efficientip_system }}/api/v2.0/ipam/address/list?limit=1000&offset={{ item }}"
validate_certs: no
url_username: "{{ efficientip_user }}"
url_password: "{{ efficientip_password }}"
dest: "ipaddresses_{{ efficientip_system }}_{{ item }}.json"
the problem is that we are collecting data from the endpoint which are hundreds of MB. bascially all the retrieved data is in the ansible results object. as each request is saved directly to disk we don't have a problem with the pure playbook run. no spike in memory when running this without the opentelemetry callback plugin.
the problem is that the callback plugin collects all the attributes here even when ANSIBLE_OPENTELEMETRY_DISABLE_LOGS is enabled. the memory "leak" happens before any opentelemetry code is actually run.
https://github.com/ansible-collections/community.general/blob/7dd7cbdba8b5ca422fbe826cad63392fe1f50ee7/plugins/callback/opentelemetry.py#L330-L338
but there seems to be another issue with the actual opentelemetry code. all the spans are sent only after running the whole playbook which bascially means everything is kept in memory until the end. this also results in our issue that we have OTLP payloads which outgrow the 4MB limit of gRPC. So there seems to be also an issue with the BatchProcessor implementation.
basically we are getting this error if a playbook makes it to the end.
Transient error StatusCode.RESOURCE_EXHAUSTED encountered while exporting traces to otlp.mydomain.com, retrying in 1s.
Transient error StatusCode.RESOURCE_EXHAUSTED encountered while exporting traces to otlp.mydomain.com, retrying in 2s.
Transient error StatusCode.RESOURCE_EXHAUSTED encountered while exporting traces to otlp.mydomain.com, retrying in 4s.
Transient error StatusCode.RESOURCE_EXHAUSTED encountered while exporting traces to otlp.mydomain.com, retrying in 8s.
Transient error StatusCode.RESOURCE_EXHAUSTED encountered while exporting traces to otlp.mydomain.com, retrying in 16s.
Transient error StatusCode.RESOURCE_EXHAUSTED encountered while exporting traces to otlp.mydomain.com, retrying in 32s.
so i think the main issue is that the callback plugin collects everything until the end and doesn't free up the memory after each ansible task. this results in huge growing memory consumption instead of linear consumption.
Thanks @OneCyrus for providing further details.
the problem is that the callback plugin collects all the attributes here even when ANSIBLE_OPENTELEMETRY_DISABLE_LOGS is enabled. the memory "leak" happens before any opentelemetry code is actually run.
IIUC, those attributes are not the problem, they are primitive values rather than complex data structures, for instance
---
- name: playbook
hosts: localhost
connection: local
tasks:
- name: Get build data
ansible.builtin.uri:
url: "https://artifacts-api.elastic.co/v1/versions/8.14-SNAPSHOT/builds/8.14.0-db1ebed2"
validate_certs: no
produced
"attributes": {
"ansible.task.module": "ansible.builtin.uri",
"ansible.task.message": "success",
"ansible.task.name": "[localhost] playbook: Get build data",
"ansible.task.result": 0,
"ansible.task.host.name": "localhost",
"ansible.task.host.status": "ok",
"ansible.task.args.name": [
"url",
"validate_certs"
],
"ansible.task.args.value": [
"https://artifacts-api.elastic.co/v1/versions/8.14-SNAPSHOT/builds/8.14.0-db1ebed2",
"False"
],
"http.url": "https://artifacts-api.elastic.co/v1/versions/8.14-SNAPSHOT/builds/8.14.0-db1ebed2"
},
But the problem is regarding the event that stores the _dump_results that contains the content of the request.
I'm trying to recall how this was implemented in https://github.com/ansible-collections/community.general/pull/4175
I decided to call self._dump_results in each callback function and store the object in the TaskData. I don't know if this could be problematic and it's worthy to use a different approach, like calling _dump_results when needed. In fact, that's how the logstash and some other plugins work. As far as I see the output from self._dump_results looks a good candidate since the data structure is ansible based, and therefore there is no need to reinvent any new format but use the one from Ansible core.
It seems _dump_results are undoubtedly massive in your case.
all the spans are sent only after running the whole playbook which bascially means everything is kept in memory until the end.
The BatchSpanProcessor can be configured to flush the ended spans faster.
I wonder whether you can configure the BatchProcessor to release the spans faster as explained in https://opentelemetry-python.readthedocs.io/en/latest/sdk/trace.export.html#opentelemetry.sdk.trace.export.BatchSpanProcessor
so i think the main issue is that the callback plugin collects everything until the end and doesn't free up the memory after each ansible task. this results in huge growing memory consumption instead of linear consumption.
Yes, that's the current implementation: store data in memory to be processed afterwards and create the span/trace correlation. And it's not as efficient, as you pointed.
I need to figure out if we can make fix the underline issue to honour ANSIBLE_OPENTELEMETRY_DISABLE_LOGS so all the huge dump is not stored in memory, while working on improving the efficiency.
OTOH, potentially, we could pass a set of denied attributes that could be used to delete some keys from the _dump_results, such as the one with the response. That's likely a good tradeoff as it allows users to configure their data to be processed. However, it might also introduce some complexity since the _dump_results needs to be known to be able to reduce the number of keys.
In other words, if the event contains something like:
"events": [
{
"name": {
"changed": false,
"connection": "close",
"content_security_policy": "object-src 'none'; script-src 'unsafe-inline' 'unsafe-eval' 'strict-dynamic' https: http:;",
"content_type": "application/json;charset=utf-8",
"cookies": {},
"cookies_string": "",
"date": "Wed, 15 May 2024 08:37:37 GMT",
"elapsed": 0,
"json": {
"aliases": [
"7.17-SNAPSHOT",
"8.15.0-SNAPSHOT"
]
},
"last_modified": "Wed, 15 May 2024 08:33:34 UTC",
"msg": "OK (unknown bytes)",
"redirected": false,
"status": 200,
"strict_transport_security": "max-age=15724800; includeSubDomains",
"transfer_encoding": "chunked",
"url": "https://artifacts-api.elastic.co/v1/versions",
"vary": "Accept-Encoding",
"x_content_type_options": "nosniff",
"x_download_options": "noopen",
"x_frame_options": "DENY",
"x_permitted_cross_domain_policies": "none",
"x_xss_protection": "1; mode=block"
},
"timestamp": "2024-05-15T08:37:37.839467Z",
"attributes": {}
}
],
then a new option called exclude_attribute_logs could be available, and set to name.json,name.url,name.changed,name.content_security_policy,name.content_type,name.connection and the event will be something like:
"events": [
{
"name": {
"cookies": {},
"cookies_string": "",
"elapsed": 0,
"last_modified": "Wed, 15 May 2024 08:33:34 UTC",
"msg": "OK (unknown bytes)",
"redirected": false,
"status": 200,
"strict_transport_security": "max-age=15724800; includeSubDomains",
"transfer_encoding": "chunked",
"url": "https://artifacts-api.elastic.co/v1/versions",
"vary": "Accept-Encoding",
"x_content_type_options": "nosniff",
"x_download_options": "noopen",
"x_frame_options": "DENY",
"x_permitted_cross_domain_policies": "none",
"x_xss_protection": "1; mode=block"
},
"timestamp": "2024-05-15T08:37:37.839467Z",
"attributes": {}
}
],
thanks for you analysis!
If you are interested in addressing the underlying issue, i guess it would be the best way to directly use the opentelemetry tracer and not collect each task into tasks_data. basically by using a global opentelemetry tracerprovider it should be possible to report a span after each task. so no further information about a task is required after creating this span. in the end generate_distributed_traces isn't required at all then.
basically register the tracerprovider on v2_playbook_on_start.
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (
BatchSpanProcessor,
ConsoleSpanExporter,
)
provider = TracerProvider()
processor = BatchSpanProcessor(ConsoleSpanExporter())
provider.add_span_processor(processor)
# Sets the global default tracer provider
trace.set_tracer_provider(provider)
# Creates a tracer from the global tracer provider
tracer = trace.get_tracer("my.tracer.name")
https://opentelemetry.io/docs/languages/python/instrumentation/#acquire-tracer
start a span on v2_playbook_on_task_start
with tracer.start_as_current_span("span-name")
get the current span on task end report the task result
current_span = trace.get_current_span()
# enrich 'current_span' with some information
disabling the logs is now honoured and won't produce any memory leaks, so the changes have been merged recently, in case you want to give a swing.
Thanks for your suggestion. It might require some analysis to double-check that it will work with the existing parameters.
@v1v thanks! once #8430 is released I can give this a try again and report.