sentry-python icon indicating copy to clipboard operation
sentry-python copied to clipboard

Broken connections on Celery worker with logging enabled

Open mcboman opened this issue 6 months ago • 7 comments

How do you use Sentry?

Sentry Saas (sentry.io)

Version

2.29.1

Steps to Reproduce

Hi,

I'm running a Celery worker with 16 concurrent workers and recently enabled the experimental logging feature in the Python SDK. Everything works fine initially, but after a while I start seeing this error repeatedly:

[2025-06-08 08:23:14,183: WARNING/ForkPoolWorker-27] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2570)')': /api/4509207728947280/envelope/

This happens under high task load, and workers eventually become unresponsive. If I disable the Sentry logging integration, the problem disappears.

Setup details:

Celery 5.5.2

Python Sentry SDK versions tried: 2.28.0 and 2.29.1

Python version: 3.12.8

Worker command:

celery worker \
  --concurrency=16 \
  --loglevel=info \
  -Q q1,q2 \
  --max-memory-per-child=2621440 \
  --max-tasks-per-child=10 \
  --without-gossip \
  --without-mingle \
  --heartbeat-interval=10

Any idea what could be causing this SSL error with the logging feature?

Expected Result

No SSL errors

Actual Result


[2025-06-08 08:23:14,183: WARNING/ForkPoolWorker-27] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2570)')': /api/4509207728947280/envelope/

mcboman avatar Jun 09 '25 17:06 mcboman

Hey @mcboman, thanks for raising this.

To clarify, when you say

If I disable the Sentry logging integration, the problem disappears.

Do you mean you add LoggingIntegration to disabled_integrations in your init, or that you turn off/remove the experimental enable_logs option?

Before the SSL errors start popping up, is there any other error visible in the logs? I'm not sure where the SSL errors could be coming from -- my hunch is that it's a consequence of something else happening.

How many logs approximately are your workers sending to Sentry? If you adjust sentry_logs_level to a higher level (see below), does it change anything?

import logging
from sentry_sdk.integrations.logging import LoggingIntegration

sentry_sdk.init(
    ...
    integrations=[LoggingIntegration(sentry_logs_level=logging.ERROR),]  # or maybe logging.WARNING; default is logging.INFO
)

sentrivana avatar Jun 11 '25 12:06 sentrivana

Hi @sentrivana

It is the experimental feature for logging and not the instrumentation. I have attached the code for my sentry.init

Peak logs is 90k pr half hour.

sentry_sdk.init(
            dsn="<dsn>",
            send_default_pii=True,
            integrations=[
                CeleryIntegration(
                    monitor_beat_tasks=True,
                ),
                SqlalchemyIntegration(),
            ],
            traces_sample_rate=0.1,
            profile_session_sample_rate=0.1,
            profile_lifecycle="trace",
            _experiments={
                "enable_logs": True,
             },
        )

mcboman avatar Jun 12 '25 07:06 mcboman

Thanks for the information, we will investigate this

szokeasaurusrex avatar Jun 16 '25 08:06 szokeasaurusrex

We'll try to load test to see if we can reproduce this.

In the meantime, any additional information would be helpful. Some things I can think of atm:

  1. Can you see any other error prior to the SSL messages? This could help in pinpointing the initial issue.
  2. Does adjusting the level (sentry_logs_level, see example) by setting it higher help? Or does it maybe delay the issue? This could tell us if this is a load problem or not.
  3. Where are you initializing Sentry? This might be useful info for reproducing the error.

sentrivana avatar Jun 16 '25 12:06 sentrivana

Wasn't able to get this to trigger in a load test.

In addition to the points above, I see you have profiling on, is there perhaps anything that stands out in the profiles?

(We can also take a look if you don't mind sharing your org (or feel free to send it to me at ivana.kellyer at sentry.io).)

sentrivana avatar Jun 17 '25 09:06 sentrivana


def create_celery() -> Celery:
    """
    Create a new Celery object and set the broker and result backend from config.
    """
    sentry_mode = os.getenv("SENTRY_MODE", "prod")

    if sentry_mode != "dev":
        sentry_sdk.init(
            dsn="dsn",
            send_default_pii=True,
            integrations=[
                CeleryIntegration(
                    monitor_beat_tasks=True,
                ),
                SqlalchemyIntegration(),
            ],
            traces_sample_rate=0.1,
            profile_session_sample_rate=0.1,
            profile_lifecycle="trace",
        )
    setup_logging()
    celery = Celery(__name__)
    celery.conf.broker_url = get_redis_url_for_celery()
    celery.conf.result_backend = get_redis_url_for_celery()

    celery.conf.broker_connection_retry_on_startup = True
    celery.conf.update(
        {
            "worker_hijack_root_logger": False,
            "timezone": "Europe/Copenhagen",
            "enable_utc": True,
            "accept_content": ["json"], 
            "task_serializer": "json", 
            "result_serializer": "json",  
            "worker_concurrency": 5,  
            "worker_prefetch_multiplier": 1,  
            "broker_connection_retry": True,  
            "broker_pool_limit": 0,
            "broker_transport_options": {
                "max_connections": 1000,  
                "visibility_timeout": 3600,  
                "socket_timeout": 30,  
                "socket_connect_timeout": 30,  
                "socket_keepalive": True, 
                "retry_on_timeout": True,  
            },
            "task_acks_late": True,
        }
    )

    print("Celery timezone: ", celery.conf.timezone)
    return celery

This is how we initialize it.

Entry for starting the container:

CELERY_OPTS_DEFAULT=(
  --concurrency=16
  --loglevel=info
  -Q q1,q2,q3,q4
  --max-memory-per-child=2621440
  --max-tasks-per-child=10
  --without-gossip
  --without-mingle
  --heartbeat-interval=10
)


exec celery -A "$CELERY_APP" worker "${CELERY_OPTS_ARRAY[@]}"

There is no prior to the exceptions, but I suspect the max-tasks-per-child restart of a worker might be what starts it all, because it doesn't start before some time.

mcboman avatar Jun 18 '25 06:06 mcboman

@mcboman could you please provide us with your org slug/ID or DSN? Thanks

szokeasaurusrex avatar Jun 23 '25 09:06 szokeasaurusrex

I want to give a little update.

I changed it to initialize sentry in a @worker_process_init.connect instead of running it in the create_celery method.

It has been running for some days now without any problems.

I have updated the SDK to version >= 2.30.0.

The root cause might have been the worker restarting after 10 tasks causing a bad cut off.

mcboman avatar Jun 26 '25 17:06 mcboman

@mcboman glad to hear you were able to fix the problem!

szokeasaurusrex avatar Jun 27 '25 08:06 szokeasaurusrex