prefect icon indicating copy to clipboard operation
prefect copied to clipboard

Memory leak in Prefect server?

Open mattiamatrix opened this issue 4 months ago • 40 comments

Bug summary

Hello Prefect! I think you can clearly see when I migrated to Prefect 3.

This is the memory of my Prefect server 3 running on an AWS ECS task. The instance has 2 vCPU and 4 GB memory.

Image

Version info

Version:             3.4.9
API version:         0.8.4
Python version:      3.12.8
Git commit:          1001c54f
Built:               Thu, Jul 17, 2025 09:48 PM
OS/Arch:             darwin/arm64
Server type:         server
Pydantic version:    2.11.7
Integrations:
  prefect-aws:       0.5.12
  prefect-gcp:       0.6.8
  prefect-sqlalchemy: 0.5.3

Additional context

I am sorry to say this, but Prefect 3 server is pretty unstable, and I am not sure if I would consider it production-ready. Prefect 2 was a gem in comparison.

mattiamatrix avatar Jul 29 '25 15:07 mattiamatrix

Hey @mattiamatrix! The events system we added in Prefect 3 runs in memory by default and could be causing the behavior you're seeing. I recommend looking at our guide for scaling a self-hosted Prefect server. In particular, running a Redis server for event messaging should help a lot. Let me know if you continue seeing raised memory usage after adding Redis.

desertaxle avatar Jul 30 '25 00:07 desertaxle

I would not recommend to enable the redis event messaging in a multi server setup right now. We just tried that and ran into high CPU load issues due to spawning of a lot of temporary server instances inside each server pod. Disabling the redis fixed that behavior again. I think this might be related to https://github.com/PrefectHQ/prefect/issues/18654

AlexanderBabel avatar Aug 07 '25 12:08 AlexanderBabel

@AlexanderBabel Can you elaborate on the temporary server instances you're seeing? That shouldn't be happening, and any additional info you can share will help me track down the source of the issue.

desertaxle avatar Aug 07 '25 12:08 desertaxle

We deployed redis and added the following env variables:

global:
  prefect:
    env:
      - name: PREFECT_SERVER_ALLOW_EPHEMERAL_MODE
        value: "False"
      - name: PREFECT_MESSAGING_BROKER
        value: prefect_redis.messaging
      - name: PREFECT_MESSAGING_CACHE
        value: prefect_redis.messaging
      - name: PREFECT_REDIS_MESSAGING_HOST
        value: prefect-redis-master.3p-prefect.svc.cluster.local
      - name: PREFECT_REDIS_MESSAGING_PORT
        value: "6379"
      - name: PREFECT_REDIS_MESSAGING_DB
        value: "0"
      - name: PREFECT_REDIS_MESSAGING_PASSWORD
        valueFrom:
          secretKeyRef:
            name: prefect-redis-secret
            key: redis-password

After analyzing the logs and metrics I saw that we encountered high load on the database that pushed the database into recovery mode.

Finally, the prefect-server pods were having issues getting access again to the redis queue which you can see in the attached logs.

pod.log

AlexanderBabel avatar Aug 07 '25 12:08 AlexanderBabel

Hi, this looks pretty much like our experience in #18654. Not sure if there were multiple processes of prefect server running in one pod container, but the CPU usage definitely went above 1 core, so I'm guessing there were additional processes or threads spawned (not necessarily prefect-server). From listing the processes in a container, there's only the entrypoint on PID 1 and prefect server run by the entrypoint script on PID 6 or 7. Possibly some threads then.

criskurtin avatar Aug 07 '25 16:08 criskurtin

Hi,

I deployed the new version yesterday and was able to activate Redis messaging again. Our setup uses drastically less RAM now. Thanks to the team for the quick responses and the fix of the issue. Highly appreciated!

Image

AlexanderBabel avatar Aug 12 '25 09:08 AlexanderBabel

Thank you for the suggestion @desertaxle.

I think it would be appreciated if the Prefect team could dig into the actual issue that's causing this memory leak in the Prefect 3 server, as some might not be interested in "scaling self-hosted Prefect" or having to introduce more infrastructure like a Redis instance.

Thank you.

mattiamatrix avatar Aug 13 '25 15:08 mattiamatrix

@mattiamatrix I may have fixed the issue in https://github.com/PrefectHQ/prefect/pull/18679 which was released with 3.4.12. Can you try running that version in standalone mode and see if you still see an issue with memory usage? If you're still seeing the issue, I'll investigate further.

desertaxle avatar Aug 13 '25 15:08 desertaxle

@mattiamatrix I may have fixed the issue in #18679 which was released with 3.4.12. Can you try running that version in standalone mode and see if you still see an issue with memory usage? If you're still seeing the issue, I'll investigate further.

👍 I upgraded to 3.4.12 this morning, I'll update you in a couple of days 🤞

mattiamatrix avatar Aug 13 '25 15:08 mattiamatrix

Hi @AlexanderBabel, did you redis instance memory is doing well? is it not happening the same problem as this https://github.com/PrefectHQ/prefect/issues/18654#issuecomment-3185809164?

lucasbelo777 avatar Aug 13 '25 21:08 lucasbelo777

Hi @lucasbelo777,

it looks like the memory leak now moved to the redis instead.

Image Image

AlexanderBabel avatar Aug 13 '25 21:08 AlexanderBabel

@AlexanderBabel there should be a fix for that in #18642. Which will go out in the next prefect-redis release.

desertaxle avatar Aug 13 '25 23:08 desertaxle

@desertaxle I let it run for a few days, but sadly, I see no difference.

Image

mattiamatrix avatar Aug 18 '25 15:08 mattiamatrix

Thanks for reporting back @mattiamatrix! Since others are seeing reasonable memory usage when using Redis for the messaging layer, I'll dig into our in-memory messaging layer and see if I can find any places where we're holding onto messages longer than we should.

desertaxle avatar Aug 18 '25 16:08 desertaxle

@desertaxle Thanks for pointing out the fix. We deployed .14 version yesterday and saw a massive drop in memory usage on our redis instance. Thanks again for pushing out fixes that quickly!

Image

AlexanderBabel avatar Aug 22 '25 11:08 AlexanderBabel

I experience the redis memory leak as well:

Image

marcm-ml avatar Aug 25 '25 16:08 marcm-ml

I'm experiencing the redis memory leak too. It will continue to grow until the memory maxes out and AKS will kill the pod. The pod will then try to restart, but fail, because it continues to load everything back into memory. Which causes a complete outage.

Any guidance on configuring the redis server so that it is able to manage its own memory better? What exactly could be causing it to hold on?

maitlandmarshall avatar Sep 11 '25 21:09 maitlandmarshall

We also have memory leak running prefect 3.4.14...

timoooo avatar Sep 22 '25 08:09 timoooo

We are also observing the memory leak running 3.4.8 without Redis.

khu-taproot avatar Sep 25 '25 13:09 khu-taproot

It seems to be fixed with 3.4.19

timoooo avatar Sep 25 '25 13:09 timoooo

I have been running 3.4.19 for a few days, and nothing has changed. I didn't see anything being mentioned in the release changelog related to this issue, so I didn't really expect any improvement.

Image

To remind you, I opened this issue specifically related to the Prefect server without Redis, because I do not wish to add a Redis server that would increase my AWS costs for little to no reason. Prefect 2 was working perfectly well on this front.

In fact, @desertaxle, this seems to be related to the new "events" system. Is it possible to disable it? I'm not sure what the experience of other people here is, but the new Event Feed page at <prefect-url>/events is quite limited, as I'm unable to load more than 1 hour's worth of events.

mattiamatrix avatar Sep 25 '25 14:09 mattiamatrix

@mattiamatrix Can confirm we still have memory leak :/

timoooo avatar Sep 29 '25 08:09 timoooo

Also have memory leak in version 3.4.22

Image

Adviser-ua avatar Oct 09 '25 11:10 Adviser-ua

Confirm that nothing has changed with 3.4.22! It's becoming frustrating.

Image

@desertaxle, could the Python version have any impact? I am currently using the image prefecthq/prefect:3.4.22-python3.12.

mattiamatrix avatar Oct 09 '25 12:10 mattiamatrix

I think I've narrowed down the issue to some consumers either not being able to keep up with the volume of events or crashing and not consuming messages.

In https://github.com/PrefectHQ/prefect/pull/19136, I put a cap on the size of queues for the in-memory messaging implementation. That should prevent runaway memory growth, but it will result in dropped messages if consumers aren't keeping up. There will be warning logs if messages are dropped, which should help us track down which consumer(s) are causing this issue.

If you see a warning log like Subscription queue is full, dropping message for topic=%r after upgrading to 3.4.24 (not yet released), please post it here to help with troubleshooting.

desertaxle avatar Oct 10 '25 15:10 desertaxle

@desertaxle, I have been running 3.4.24 for a couple of days, and I see no improvements. And I don't see the warning log that you mentioned in your last message. 😮‍💨

mattiamatrix avatar Oct 23 '25 16:10 mattiamatrix

@desertaxle Same here. 3.4.24 sitll causes memory leak

timoooo avatar Oct 27 '25 09:10 timoooo

We have set up Redis for the messaging layer and updated to version 3.4.24. Although Redis memory was doing well, the problem very clearly persists in the server memory. We have had this problem since we moved to Prefect 3. It is actually very frustrating.

Image

Question: why applying these settings has no effect on the memory leak?

PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED=false
PREFECT_SERVER_SERVICES_EVENT_LOGGER_ENABLED=false

msa980 avatar Nov 11 '25 14:11 msa980

We have set up Redis for the messaging layer and updated to version 3.4.24. Although Redis memory was doing well, the problem very clearly persists in the server memory. We have had this problem since we moved to Prefect 3. It is actually very frustrating.

Image Question: why applying these settings has no effect on the memory leak?
PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED=false
PREFECT_SERVER_SERVICES_EVENT_LOGGER_ENABLED=false

@msa980, are you saying that adding Redis did not fix the memory leak? I do not want to add another piece of infrastructure, but I was close to testing that as a last resort.

For the record, memory leak is still present in 3.5.0.

@desertaxle @zzstoatzz, is there anything on your side that could help?

mattiamatrix avatar Nov 11 '25 14:11 mattiamatrix

@mattiamatrix exactly, Redis did not solve the problem. Same memory growing consumption pace as before.

msa980 avatar Nov 11 '25 14:11 msa980