kuma icon indicating copy to clipboard operation
kuma copied to clipboard

T - Stream of "Refusing to deserialize disabled content of type pickle" errors on prod celery-workers

Open escattone opened this issue 4 years ago • 10 comments

Summary Each of the prod Celery workers log a steady stream of these errors:

ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)

even though all of the message queues are empty. We don't know why this is happening, and seems to have been happening for at least the past few days (based on Papertrail).

escattone avatar Dec 13 '19 02:12 escattone

Yep, this is still happening. A lot. I've never seen this locally in docker-compose. In Papertrail it's happening over and over. It's happening so often and frequently in Celery that I wonder if something is stuck in an infinite loop of it never getting out.

We have CELERY_RESULT_SERIALIZER = "pickle" and CELERY_EVENT_SERIALIZER = "pickle"and CELERY_ACCEPT_CONTENT = ["pickle"] meaning we're 100% using pickle for everything between Django and celery. So why would it have a beef with it? (Yes, I wish we were 100% JSON but it is what it is with history etc)

No idea where it could have come from then.

peterbe avatar Mar 20 '20 19:03 peterbe

Timebox investigate for a day and reassess how difficult this is.

tobinmori avatar Mar 23 '20 17:03 tobinmori

It's definitely still happening: https://my.papertrailapp.com/groups/10501011/events?q=mdn-prod_celery-worker It's not happening on stage: https://my.papertrailapp.com/groups/10501011/events?q=mdn-stage_celery-worker

@escattone Can we just drain the queue on prod celery?

Perhaps we can take this opportunity to use poetry to upgrade kumbu, celery, billiard and all of those guys to the latest version (for django 2.11) and at least have that in our backs.

peterbe avatar May 07 '20 14:05 peterbe

@escattone Can we just drain the queue on prod celery?

I tried that a while back, and no difference.

Perhaps we can take this opportunity to use poetry to upgrade kumbu, celery, billiard and all of those guys to the latest version (for django 2.11) and at least have that in our backs.

👍 I just did this via https://github.com/mdn/kuma/pull/7254, and unfortunately it also made no difference. Still good to do though!

escattone avatar Jun 20 '20 22:06 escattone

Deploying https://github.com/mdn/kuma/pull/7255 to see if that resolves this issue.

Nope, that didn't work either. Frustrating.

escattone avatar Jun 20 '20 22:06 escattone

Note-to-self; Here's a direct link to Papertrail for the prod celery worker

peterbe avatar Jun 22 '20 14:06 peterbe

@peterbe I was able to confirm that this issue also exists on stage, and is the same issue talked about in https://github.com/celery/celery/issues/5075. As soon as I increased the number of celery-worker pods from 1 to 2, the problem immediately appeared in the stage celery-worker papertrail logs:

Jun 22 16:22:09 mdn-mdn-stage-celery-worker-6cb58d7497-5jmhb celery-worker: [2020-06-22 16:22:08,405: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
Jun 22 16:22:11 mdn-mdn-stage-celery-worker-6cb58d7497-6kkxn celery-worker: [2020-06-22 16:22:09,911: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
Jun 22 16:22:11 mdn-mdn-stage-celery-worker-6cb58d7497-5jmhb celery-worker: [2020-06-22 16:22:10,407: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize) 

escattone avatar Jun 22 '20 23:06 escattone

For the record, we figured out where this message is coming from. In a sense. Or rather, what's "causing it". If you start multiple celery ... you get this.

In docker-compose.yml for example: Screen Shot 2020-06-22 at 9 23 45 PM

Now you get a wall of these:

worker2_1        | [2020-06-22 18:24:59,518: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker_1         | [2020-06-22 18:24:59,655: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker2_1        | [2020-06-22 18:25:01,489: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker_1         | [2020-06-22 18:25:01,624: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker2_1        | [2020-06-22 18:25:03,494: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker_1         | [2020-06-22 18:25:03,631: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker2_1        | [2020-06-22 18:25:05,498: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker_1         | [2020-06-22 18:25:05,634: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker2_1        | [2020-06-22 18:25:07,500: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker_1         | [2020-06-22 18:25:07,636: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker2_1        | [2020-06-22 18:25:09,503: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker_1         | [2020-06-22 18:25:09,638: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker2_1        | [2020-06-22 18:25:11,506: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker_1         | [2020-06-22 18:25:11,641: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker2_1        | [2020-06-22 18:25:13,510: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker_1         | [2020-06-22 18:25:13,645: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)

So, if we just stop having multiple celery pods, we'll make this error go away. We're still not sure if it matters at all.

peterbe avatar Jun 23 '20 01:06 peterbe

We're starting only 1 celery beat but 4 workers. Presumably each with a concurrency of 4 (coincidence).

▶ (mozilla-mdn-maws-mdn-admin) kubectl -n mdn-prod get pods
NAME                                     READY   STATUS      RESTARTS   AGE
api-66f9968bfc-7rrfs                     1/1     Running     0          2d3h
api-66f9968bfc-dzxz6                     1/1     Running     0          2d3h
api-66f9968bfc-g6xwr                     1/1     Running     0          2d3h
api-66f9968bfc-xrmmw                     1/1     Running     0          2d3h
celery-beat-5945b97474-6kb2n             1/1     Running     0          5h48m
celery-worker-65fbf989b4-ddmm4           1/1     Running     0          7h29m
celery-worker-65fbf989b4-kh4mn           1/1     Running     0          7h29m
celery-worker-65fbf989b4-wntql           1/1     Running     0          7h29m
celery-worker-65fbf989b4-wzn5s           1/1     Running     0          7h29m
kumascript-58c7895899-l5zbm              1/1     Running     0          6h13m
kumascript-58c7895899-s85mq              1/1     Running     0          6h13m
kumascript-58c7895899-w4stw              1/1     Running     0          6h13m
kumascript-58c7895899-wv5t7              1/1     Running     0          6h13m
mdn-admin-84f778bffc-7s74k               1/1     Running     0          12d
mdn-db-migration-7qw9x                   0/1     Completed   0          2d3h
mdn-kuma-record-deployment-gkzcm         0/1     Completed   0          2d2h
mdn-kumascript-record-deployment-kwsqj   0/1     Completed   0          6h13m
ssr-59d98b564f-56wdp                     1/1     Running     0          2d3h
ssr-59d98b564f-dnjrt                     1/1     Running     0          2d3h
ssr-59d98b564f-n9dh5                     1/1     Running     0          2d3h
ssr-59d98b564f-qchkr                     1/1     Running     0          2d3h
ssr-59d98b564f-tt8j7                     1/1     Running     0          2d3h
web-546785d6f7-5nmc2                     1/1     Running     0          2d3h
web-546785d6f7-6ljrt                     1/1     Running     0          2d3h
web-546785d6f7-9d5bv                     1/1     Running     0          2d3h
web-546785d6f7-bfj8f                     1/1     Running     0          2d3h
web-546785d6f7-bnzhx                     1/1     Running     0          2d3h
web-546785d6f7-dsnxc                     1/1     Running     0          2d3h
web-546785d6f7-hf5hw                     1/1     Running     0          2d3h
web-546785d6f7-pnwxr                     1/1     Running     0          2d3h
web-546785d6f7-sw7x4                     1/1     Running     0          2d3h
web-546785d6f7-tbm5t                     1/1     Running     0          2d2h

What if we have the wrong number of concurrency? What if we reduce the number of workers down to a smaller number? Like 1.

By the way, I can simulate the problem on my laptop with a temporary hack to docker-compose.yml but I started 3,000 little sample tasks that all just did a sleep(random.random()) and all 3,000 of them eventually ran.

So it could just be a falsy warning. I.e. this doesn't actually matter.

peterbe avatar Jun 23 '20 01:06 peterbe

This will decrease in importance greatly, when Yari is released.

tobinmori avatar Jul 13 '20 19:07 tobinmori