community.general icon indicating copy to clipboard operation
community.general copied to clipboard

opentelemetry: memory "leak"/inefficient data handling

Open OneCyrus opened this issue 1 year ago • 9 comments

Summary

we have a playbook which does a lot of loops with quite a bit of data handling. the opentelemetry callback plugin seems to collect a lot of data and the memory consumption grows really fast. the opentelemetry callback plugin uses more than 24GB of memory for this playbook. we already disabled logs and logs in attributes. but that doesn't make a change as this doesn't change the data collection at all.

image

Issue Type

Bug Report

Component Name

opentelemetry callback plugin

Ansible Version

independent from any ansible version

Community.general Version

8.6.0

Configuration

$ ansible-config dump --only-changed

OS / Environment

CentOS 7

Steps to Reproduce

just any playbook with a lot of data in the ansible "result"

Expected Results

the opentelemetry playbook shouldn't collect all the responses. and it should also send the data continuously and not collect until the end. this would help to free up the memory.

Actual Results


Code of Conduct

  • [X] I agree to follow the Ansible Code of Conduct

OneCyrus avatar May 13 '24 12:05 OneCyrus

Files identified in the description:

  • lib/ansible/plugins/callback

If these files are incorrect, please update the component name section of the description or use the !component bot command.

click here for bot help

ansibullbot avatar May 13 '24 12:05 ansibullbot

!component =plugins/callback/opentelemetry.py

felixfontein avatar May 13 '24 16:05 felixfontein

Files identified in the description:

If these files are incorrect, please update the component name section of the description or use the !component bot command.

click here for bot help

ansibullbot avatar May 13 '24 16:05 ansibullbot

cc @v1v click here for bot help

ansibullbot avatar May 13 '24 16:05 ansibullbot

Hi @OneCyrus,

can you provide a bit more context about your setup?

  • Ansible version
  • Python version
  • OpenTelemetry libraries:
    • opentelemetry-api
    • opentelemetry-exporter-otlp
    • opentelemetry-sdk

In addition, can you share your playbook (without any sensitive details) to help with reproducing the error?

Thanks

v1v avatar May 13 '24 17:05 v1v

Some env variables can help to reduce the number of attributes that are stored per span/event:

In addition, the existing implementation uses the BatchProcessor, and it can be configured with some further environment variables as explained here

Have you tried to configure those environment variables? Did you see any improvements?

v1v avatar May 13 '24 19:05 v1v

the playbook is rather simple.

playbook.yaml

- hosts: "localhost"
  vars:
    efficientip_system: "fqdn"
    efficientip_user: "user"
    efficientip_password: "pw"
  tasks:
    - name: Get IP address count for page numbers
      ansible.builtin.uri:
        url: "https://{{ efficientip_system }}/api/v2.0/ipam/address/count"
        validate_certs: no
        url_username: "{{ efficientip_user }}"
        url_password: "{{ efficientip_password }}"
      register: ipaddresses_count

    - name: Get all IP addresses
      ansible.builtin.include_tasks: efficientip_getpage_addresses.yaml
      with_sequence: start=0 end="{{ ipaddresses_count.json.data[0].count }}" stride=1000

efficientip_getpage_addresses.yaml

- name: "Get all IP addresses: {{ item }}"
  ansible.builtin.uri:
    url: "https://{{ efficientip_system }}/api/v2.0/ipam/address/list?limit=1000&offset={{ item }}"
    validate_certs: no
    url_username: "{{ efficientip_user }}"
    url_password: "{{ efficientip_password }}"
    dest: "ipaddresses_{{ efficientip_system }}_{{ item }}.json"

the problem is that we are collecting data from the endpoint which are hundreds of MB. bascially all the retrieved data is in the ansible results object. as each request is saved directly to disk we don't have a problem with the pure playbook run. no spike in memory when running this without the opentelemetry callback plugin.

the problem is that the callback plugin collects all the attributes here even when ANSIBLE_OPENTELEMETRY_DISABLE_LOGS is enabled. the memory "leak" happens before any opentelemetry code is actually run.

https://github.com/ansible-collections/community.general/blob/7dd7cbdba8b5ca422fbe826cad63392fe1f50ee7/plugins/callback/opentelemetry.py#L330-L338

but there seems to be another issue with the actual opentelemetry code. all the spans are sent only after running the whole playbook which bascially means everything is kept in memory until the end. this also results in our issue that we have OTLP payloads which outgrow the 4MB limit of gRPC. So there seems to be also an issue with the BatchProcessor implementation.

basically we are getting this error if a playbook makes it to the end.

Transient error StatusCode.RESOURCE_EXHAUSTED encountered while exporting traces to otlp.mydomain.com, retrying in 1s.
Transient error StatusCode.RESOURCE_EXHAUSTED encountered while exporting traces to otlp.mydomain.com, retrying in 2s.
Transient error StatusCode.RESOURCE_EXHAUSTED encountered while exporting traces to otlp.mydomain.com, retrying in 4s.
Transient error StatusCode.RESOURCE_EXHAUSTED encountered while exporting traces to otlp.mydomain.com, retrying in 8s.
Transient error StatusCode.RESOURCE_EXHAUSTED encountered while exporting traces to otlp.mydomain.com, retrying in 16s.
Transient error StatusCode.RESOURCE_EXHAUSTED encountered while exporting traces to otlp.mydomain.com, retrying in 32s.

so i think the main issue is that the callback plugin collects everything until the end and doesn't free up the memory after each ansible task. this results in huge growing memory consumption instead of linear consumption.

OneCyrus avatar May 14 '24 11:05 OneCyrus

Thanks @OneCyrus for providing further details.

the problem is that the callback plugin collects all the attributes here even when ANSIBLE_OPENTELEMETRY_DISABLE_LOGS is enabled. the memory "leak" happens before any opentelemetry code is actually run.

IIUC, those attributes are not the problem, they are primitive values rather than complex data structures, for instance

---
- name: playbook
  hosts: localhost
  connection: local
  tasks:
    - name: Get build data
      ansible.builtin.uri:
        url: "https://artifacts-api.elastic.co/v1/versions/8.14-SNAPSHOT/builds/8.14.0-db1ebed2"
        validate_certs: no

produced

    "attributes": {
        "ansible.task.module": "ansible.builtin.uri",
        "ansible.task.message": "success",
        "ansible.task.name": "[localhost] playbook: Get build data",
        "ansible.task.result": 0,
        "ansible.task.host.name": "localhost",
        "ansible.task.host.status": "ok",
        "ansible.task.args.name": [
            "url",
            "validate_certs"
        ],
        "ansible.task.args.value": [
            "https://artifacts-api.elastic.co/v1/versions/8.14-SNAPSHOT/builds/8.14.0-db1ebed2",
            "False"
        ],
        "http.url": "https://artifacts-api.elastic.co/v1/versions/8.14-SNAPSHOT/builds/8.14.0-db1ebed2"
    },

But the problem is regarding the event that stores the _dump_results that contains the content of the request.

image

I'm trying to recall how this was implemented in https://github.com/ansible-collections/community.general/pull/4175

I decided to call self._dump_results in each callback function and store the object in the TaskData. I don't know if this could be problematic and it's worthy to use a different approach, like calling _dump_results when needed. In fact, that's how the logstash and some other plugins work. As far as I see the output from self._dump_results looks a good candidate since the data structure is ansible based, and therefore there is no need to reinvent any new format but use the one from Ansible core.

It seems _dump_results are undoubtedly massive in your case.

all the spans are sent only after running the whole playbook which bascially means everything is kept in memory until the end.

The BatchSpanProcessor can be configured to flush the ended spans faster.

I wonder whether you can configure the BatchProcessor to release the spans faster as explained in https://opentelemetry-python.readthedocs.io/en/latest/sdk/trace.export.html#opentelemetry.sdk.trace.export.BatchSpanProcessor

so i think the main issue is that the callback plugin collects everything until the end and doesn't free up the memory after each ansible task. this results in huge growing memory consumption instead of linear consumption.

Yes, that's the current implementation: store data in memory to be processed afterwards and create the span/trace correlation. And it's not as efficient, as you pointed.

I need to figure out if we can make fix the underline issue to honour ANSIBLE_OPENTELEMETRY_DISABLE_LOGS so all the huge dump is not stored in memory, while working on improving the efficiency.

OTOH, potentially, we could pass a set of denied attributes that could be used to delete some keys from the _dump_results, such as the one with the response. That's likely a good tradeoff as it allows users to configure their data to be processed. However, it might also introduce some complexity since the _dump_results needs to be known to be able to reduce the number of keys.

In other words, if the event contains something like:

"events": [
    {
        "name": {
            "changed": false,
            "connection": "close",
            "content_security_policy": "object-src 'none'; script-src 'unsafe-inline' 'unsafe-eval' 'strict-dynamic' https: http:;",
            "content_type": "application/json;charset=utf-8",
            "cookies": {},
            "cookies_string": "",
            "date": "Wed, 15 May 2024 08:37:37 GMT",
            "elapsed": 0,
            "json": {
                "aliases": [
                    "7.17-SNAPSHOT",
                    "8.15.0-SNAPSHOT"
                ]
            },
            "last_modified": "Wed, 15 May 2024 08:33:34 UTC",
            "msg": "OK (unknown bytes)",
            "redirected": false,
            "status": 200,
            "strict_transport_security": "max-age=15724800; includeSubDomains",
            "transfer_encoding": "chunked",
            "url": "https://artifacts-api.elastic.co/v1/versions",
            "vary": "Accept-Encoding",
            "x_content_type_options": "nosniff",
            "x_download_options": "noopen",
            "x_frame_options": "DENY",
            "x_permitted_cross_domain_policies": "none",
            "x_xss_protection": "1; mode=block"
        },
        "timestamp": "2024-05-15T08:37:37.839467Z",
        "attributes": {}
    }
],

then a new option called exclude_attribute_logs could be available, and set to name.json,name.url,name.changed,name.content_security_policy,name.content_type,name.connection and the event will be something like:


"events": [
    {
        "name": {
            "cookies": {},
            "cookies_string": "",
            "elapsed": 0,
            "last_modified": "Wed, 15 May 2024 08:33:34 UTC",
            "msg": "OK (unknown bytes)",
            "redirected": false,
            "status": 200,
            "strict_transport_security": "max-age=15724800; includeSubDomains",
            "transfer_encoding": "chunked",
            "url": "https://artifacts-api.elastic.co/v1/versions",
            "vary": "Accept-Encoding",
            "x_content_type_options": "nosniff",
            "x_download_options": "noopen",
            "x_frame_options": "DENY",
            "x_permitted_cross_domain_policies": "none",
            "x_xss_protection": "1; mode=block"
        },
        "timestamp": "2024-05-15T08:37:37.839467Z",
        "attributes": {}
    }
],

v1v avatar May 15 '24 08:05 v1v

thanks for you analysis!

If you are interested in addressing the underlying issue, i guess it would be the best way to directly use the opentelemetry tracer and not collect each task into tasks_data. basically by using a global opentelemetry tracerprovider it should be possible to report a span after each task. so no further information about a task is required after creating this span. in the end generate_distributed_traces isn't required at all then.

basically register the tracerprovider on v2_playbook_on_start.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (
    BatchSpanProcessor,
    ConsoleSpanExporter,
)

provider = TracerProvider()
processor = BatchSpanProcessor(ConsoleSpanExporter())
provider.add_span_processor(processor)

# Sets the global default tracer provider
trace.set_tracer_provider(provider)

# Creates a tracer from the global tracer provider
tracer = trace.get_tracer("my.tracer.name")

https://opentelemetry.io/docs/languages/python/instrumentation/#acquire-tracer

start a span on v2_playbook_on_task_start

 with tracer.start_as_current_span("span-name")

get the current span on task end report the task result

current_span = trace.get_current_span()
# enrich 'current_span' with some information

OneCyrus avatar May 15 '24 12:05 OneCyrus

disabling the logs is now honoured and won't produce any memory leaks, so the changes have been merged recently, in case you want to give a swing.

Thanks for your suggestion. It might require some analysis to double-check that it will work with the existing parameters.

v1v avatar May 21 '24 16:05 v1v

@v1v thanks! once #8430 is released I can give this a try again and report.

OneCyrus avatar Jun 03 '24 07:06 OneCyrus