kuma
kuma copied to clipboard
T - Stream of "Refusing to deserialize disabled content of type pickle" errors on prod celery-workers
Summary Each of the prod Celery workers log a steady stream of these errors:
ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
even though all of the message queues are empty. We don't know why this is happening, and seems to have been happening for at least the past few days (based on Papertrail).
Yep, this is still happening. A lot. I've never seen this locally in docker-compose
. In Papertrail it's happening over and over. It's happening so often and frequently in Celery that I wonder if something is stuck in an infinite loop of it never getting out.
We have CELERY_RESULT_SERIALIZER = "pickle"
and CELERY_EVENT_SERIALIZER = "pickle"
and CELERY_ACCEPT_CONTENT = ["pickle"]
meaning we're 100% using pickle for everything between Django and celery. So why would it have a beef with it?
(Yes, I wish we were 100% JSON but it is what it is with history etc)
No idea where it could have come from then.
Timebox investigate for a day and reassess how difficult this is.
It's definitely still happening: https://my.papertrailapp.com/groups/10501011/events?q=mdn-prod_celery-worker It's not happening on stage: https://my.papertrailapp.com/groups/10501011/events?q=mdn-stage_celery-worker
@escattone Can we just drain the queue on prod celery?
Perhaps we can take this opportunity to use poetry to upgrade kumbu, celery, billiard and all of those guys to the latest version (for django 2.11) and at least have that in our backs.
@escattone Can we just drain the queue on prod celery?
I tried that a while back, and no difference.
Perhaps we can take this opportunity to use poetry to upgrade kumbu, celery, billiard and all of those guys to the latest version (for django 2.11) and at least have that in our backs.
👍 I just did this via https://github.com/mdn/kuma/pull/7254, and unfortunately it also made no difference. Still good to do though!
Deploying https://github.com/mdn/kuma/pull/7255 to see if that resolves this issue.
Nope, that didn't work either. Frustrating.
Note-to-self; Here's a direct link to Papertrail for the prod celery worker
@peterbe I was able to confirm that this issue also exists on stage, and is the same issue talked about in https://github.com/celery/celery/issues/5075. As soon as I increased the number of celery-worker
pods from 1 to 2, the problem immediately appeared in the stage celery-worker
papertrail
logs:
Jun 22 16:22:09 mdn-mdn-stage-celery-worker-6cb58d7497-5jmhb celery-worker: [2020-06-22 16:22:08,405: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
Jun 22 16:22:11 mdn-mdn-stage-celery-worker-6cb58d7497-6kkxn celery-worker: [2020-06-22 16:22:09,911: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
Jun 22 16:22:11 mdn-mdn-stage-celery-worker-6cb58d7497-5jmhb celery-worker: [2020-06-22 16:22:10,407: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
For the record, we figured out where this message is coming from. In a sense.
Or rather, what's "causing it". If you start multiple celery ...
you get this.
In docker-compose.yml
for example:
Now you get a wall of these:
worker2_1 | [2020-06-22 18:24:59,518: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker_1 | [2020-06-22 18:24:59,655: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker2_1 | [2020-06-22 18:25:01,489: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker_1 | [2020-06-22 18:25:01,624: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker2_1 | [2020-06-22 18:25:03,494: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker_1 | [2020-06-22 18:25:03,631: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker2_1 | [2020-06-22 18:25:05,498: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker_1 | [2020-06-22 18:25:05,634: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker2_1 | [2020-06-22 18:25:07,500: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker_1 | [2020-06-22 18:25:07,636: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker2_1 | [2020-06-22 18:25:09,503: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker_1 | [2020-06-22 18:25:09,638: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker2_1 | [2020-06-22 18:25:11,506: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker_1 | [2020-06-22 18:25:11,641: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker2_1 | [2020-06-22 18:25:13,510: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
worker_1 | [2020-06-22 18:25:13,645: ERROR/MainProcess] Refusing to deserialize disabled content of type pickle (application/x-python-serialize)
So, if we just stop having multiple celery pods, we'll make this error go away. We're still not sure if it matters at all.
We're starting only 1 celery beat but 4 workers. Presumably each with a concurrency of 4 (coincidence).
▶ (mozilla-mdn-maws-mdn-admin) kubectl -n mdn-prod get pods
NAME READY STATUS RESTARTS AGE
api-66f9968bfc-7rrfs 1/1 Running 0 2d3h
api-66f9968bfc-dzxz6 1/1 Running 0 2d3h
api-66f9968bfc-g6xwr 1/1 Running 0 2d3h
api-66f9968bfc-xrmmw 1/1 Running 0 2d3h
celery-beat-5945b97474-6kb2n 1/1 Running 0 5h48m
celery-worker-65fbf989b4-ddmm4 1/1 Running 0 7h29m
celery-worker-65fbf989b4-kh4mn 1/1 Running 0 7h29m
celery-worker-65fbf989b4-wntql 1/1 Running 0 7h29m
celery-worker-65fbf989b4-wzn5s 1/1 Running 0 7h29m
kumascript-58c7895899-l5zbm 1/1 Running 0 6h13m
kumascript-58c7895899-s85mq 1/1 Running 0 6h13m
kumascript-58c7895899-w4stw 1/1 Running 0 6h13m
kumascript-58c7895899-wv5t7 1/1 Running 0 6h13m
mdn-admin-84f778bffc-7s74k 1/1 Running 0 12d
mdn-db-migration-7qw9x 0/1 Completed 0 2d3h
mdn-kuma-record-deployment-gkzcm 0/1 Completed 0 2d2h
mdn-kumascript-record-deployment-kwsqj 0/1 Completed 0 6h13m
ssr-59d98b564f-56wdp 1/1 Running 0 2d3h
ssr-59d98b564f-dnjrt 1/1 Running 0 2d3h
ssr-59d98b564f-n9dh5 1/1 Running 0 2d3h
ssr-59d98b564f-qchkr 1/1 Running 0 2d3h
ssr-59d98b564f-tt8j7 1/1 Running 0 2d3h
web-546785d6f7-5nmc2 1/1 Running 0 2d3h
web-546785d6f7-6ljrt 1/1 Running 0 2d3h
web-546785d6f7-9d5bv 1/1 Running 0 2d3h
web-546785d6f7-bfj8f 1/1 Running 0 2d3h
web-546785d6f7-bnzhx 1/1 Running 0 2d3h
web-546785d6f7-dsnxc 1/1 Running 0 2d3h
web-546785d6f7-hf5hw 1/1 Running 0 2d3h
web-546785d6f7-pnwxr 1/1 Running 0 2d3h
web-546785d6f7-sw7x4 1/1 Running 0 2d3h
web-546785d6f7-tbm5t 1/1 Running 0 2d2h
What if we have the wrong number of concurrency? What if we reduce the number of workers down to a smaller number? Like 1.
By the way, I can simulate the problem on my laptop with a temporary hack to docker-compose.yml
but I started 3,000 little sample tasks that all just did a sleep(random.random())
and all 3,000 of them eventually ran.
So it could just be a falsy warning. I.e. this doesn't actually matter.
This will decrease in importance greatly, when Yari is released.