dagster icon indicating copy to clipboard operation
dagster copied to clipboard

`dagster-webserver` memory leak

Open aaaaahaaaaa opened this issue 1 year ago • 32 comments

Dagster version

1.5.13

What's the issue?

dagster-webserver 1.5.13 seems to have some kind of memory leak. Since we updated to that version, we can observe a steady increase in memory usage over the last couple of weeks.

  • The increase in memory usage correlates to the change of version, without any other change being introduced.
  • We observe the same behaviour on 2 different GKE clusters.
  • Reverting to 1.5.12 resolves the issue.

image image

What did you expect to happen?

No response

How to reproduce?

No response

Deployment type

Dagster Helm chart

Deployment details

No response

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

aaaaahaaaaa avatar Jan 03 '24 16:01 aaaaahaaaaa

I don't see any notable commits in 1.5.13 on initial inspection

Reverting to 1.5.12 resolves the issue.

How exactly did you do this? Can you report the python environments in the two containers (pip list / pip freeze) ? Trying to discern if its possible that the leak is from a dependency that also changed between the two container images.

alangenfeld avatar Jan 03 '24 17:01 alangenfeld

How exactly did you do this?

We changed the helm chart version. We literally just reverted the Renovate bot commit.

1.5.12

pip list

Package                     Version
--------------------------- ------------
alembic                     1.13.0
amqp                        5.2.0
aniso8601                   9.0.1
annotated-types             0.6.0
anyio                       4.1.0
async-timeout               4.0.3
azure-core                  1.29.5
azure-identity              1.15.0
azure-storage-blob          12.19.0
azure-storage-file-datalake 12.14.0
backoff                     2.2.1
billiard                    4.2.0
boto3                       1.33.12
botocore                    1.33.12
cachetools                  5.3.2
celery                      5.3.6
certifi                     2023.11.17
cffi                        1.16.0
charset-normalizer          3.3.2
click                       8.1.7
click-didyoumean            0.3.0
click-plugins               1.1.1
click-repl                  0.3.0
coloredlogs                 14.0
croniter                    2.0.1
cryptography                41.0.7
dagster                     1.5.12
dagster-aws                 0.21.12
dagster-azure               0.21.12
dagster-celery              0.21.12
dagster-celery-k8s          0.21.12
dagster-gcp                 0.21.12
dagster-graphql             1.5.12
dagster-k8s                 0.21.12
dagster-pandas              0.21.12
dagster-pipes               1.5.12
dagster-postgres            0.21.12
dagster-webserver           1.5.12
db-dtypes                   1.1.1
docstring-parser            0.15
exceptiongroup              1.2.0
flower                      2.0.1
fsspec                      2023.12.2
google-api-core             2.15.0
google-api-python-client    2.110.0
google-auth                 2.25.2
google-auth-httplib2        0.1.1
google-cloud-bigquery       3.13.0
google-cloud-core           2.4.1
google-cloud-storage        2.13.0
google-crc32c               1.5.0
google-resumable-media      2.6.0
googleapis-common-protos    1.62.0
gql                         3.4.1
graphene                    3.3
graphql-core                3.2.3
graphql-relay               3.2.0
greenlet                    3.0.2
grpcio                      1.60.0
grpcio-health-checking      1.60.0
grpcio-status               1.60.0
h11                         0.14.0
httplib2                    0.22.0
httptools                   0.6.1
humanfriendly               10.0
humanize                    4.9.0
idna                        3.6
isodate                     0.6.1
Jinja2                      3.1.2
jmespath                    1.0.1
kombu                       5.3.4
kubernetes                  28.1.0
Mako                        1.3.0
MarkupSafe                  2.1.3
msal                        1.26.0
msal-extensions             1.1.0
multidict                   6.0.4
numpy                       1.26.2
oauth2client                4.1.3
oauthlib                    3.2.2
packaging                   23.2
pandas                      2.1.4
pendulum                    2.1.2
pip                         23.0.1
portalocker                 2.8.2
prometheus-client           0.19.0
prompt-toolkit              3.0.41
proto-plus                  1.23.0
protobuf                    4.25.1
psycopg2-binary             2.9.9
pyarrow                     14.0.1
pyasn1                      0.5.1
pyasn1-modules              0.3.0
pycparser                   2.21
pydantic                    2.5.2
pydantic_core               2.14.5
PyJWT                       2.8.0
pyparsing                   3.1.1
python-dateutil             2.8.2
python-dotenv               1.0.0
pytz                        2023.3.post1
pytzdata                    2020.1
PyYAML                      6.0.1
redis                       5.0.1
requests                    2.31.0
requests-oauthlib           1.3.1
requests-toolbelt           0.10.1
rsa                         4.9
s3transfer                  0.8.2
setuptools                  65.5.1
six                         1.16.0
sniffio                     1.3.0
SQLAlchemy                  2.0.23
starlette                   0.33.0
tabulate                    0.9.0
tomli                       2.0.1
toposort                    1.10
tornado                     6.4
tqdm                        4.66.1
typing_extensions           4.9.0
tzdata                      2023.3
universal-pathlib           0.1.4
uritemplate                 4.1.1
urllib3                     1.26.18
uvicorn                     0.24.0.post1
uvloop                      0.19.0
vine                        5.1.0
watchdog                    3.0.0
watchfiles                  0.21.0
wcwidth                     0.2.12
websocket-client            1.7.0
websockets                  12.0
wheel                       0.42.0
yarl                        1.9.4

pip freeze

alembic==1.13.0
amqp==5.2.0
aniso8601==9.0.1
annotated-types==0.6.0
anyio==4.1.0
async-timeout==4.0.3
azure-core==1.29.5
azure-identity==1.15.0
azure-storage-blob==12.19.0
azure-storage-file-datalake==12.14.0
backoff==2.2.1
billiard==4.2.0
boto3==1.33.12
botocore==1.33.12
cachetools==5.3.2
celery==5.3.6
certifi==2023.11.17
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
click-didyoumean==0.3.0
click-plugins==1.1.1
click-repl==0.3.0
coloredlogs==14.0
croniter==2.0.1
cryptography==41.0.7
dagster==1.5.12
dagster-aws==0.21.12
dagster-azure==0.21.12
dagster-celery==0.21.12
dagster-celery-k8s==0.21.12
dagster-gcp==0.21.12
dagster-graphql==1.5.12
dagster-k8s==0.21.12
dagster-pandas==0.21.12
dagster-pipes==1.5.12
dagster-postgres==0.21.12
dagster-webserver==1.5.12
db-dtypes==1.1.1
docstring-parser==0.15
exceptiongroup==1.2.0
flower==2.0.1
fsspec==2023.12.2
google-api-core==2.15.0
google-api-python-client==2.110.0
google-auth==2.25.2
google-auth-httplib2==0.1.1
google-cloud-bigquery==3.13.0
google-cloud-core==2.4.1
google-cloud-storage==2.13.0
google-crc32c==1.5.0
google-resumable-media==2.6.0
googleapis-common-protos==1.62.0
gql==3.4.1
graphene==3.3
graphql-core==3.2.3
graphql-relay==3.2.0
greenlet==3.0.2
grpcio==1.60.0
grpcio-health-checking==1.60.0
grpcio-status==1.60.0
h11==0.14.0
httplib2==0.22.0
httptools==0.6.1
humanfriendly==10.0
humanize==4.9.0
idna==3.6
isodate==0.6.1
Jinja2==3.1.2
jmespath==1.0.1
kombu==5.3.4
kubernetes==28.1.0
Mako==1.3.0
MarkupSafe==2.1.3
msal==1.26.0
msal-extensions==1.1.0
multidict==6.0.4
numpy==1.26.2
oauth2client==4.1.3
oauthlib==3.2.2
packaging==23.2
pandas==2.1.4
pendulum==2.1.2
portalocker==2.8.2
prometheus-client==0.19.0
prompt-toolkit==3.0.41
proto-plus==1.23.0
protobuf==4.25.1
psycopg2-binary==2.9.9
pyarrow==14.0.1
pyasn1==0.5.1
pyasn1-modules==0.3.0
pycparser==2.21
pydantic==2.5.2
pydantic_core==2.14.5
PyJWT==2.8.0
pyparsing==3.1.1
python-dateutil==2.8.2
python-dotenv==1.0.0
pytz==2023.3.post1
pytzdata==2020.1
PyYAML==6.0.1
redis==5.0.1
requests==2.31.0
requests-oauthlib==1.3.1
requests-toolbelt==0.10.1
rsa==4.9
s3transfer==0.8.2
six==1.16.0
sniffio==1.3.0
SQLAlchemy==2.0.23
starlette==0.33.0
tabulate==0.9.0
tomli==2.0.1
toposort==1.10
tornado==6.4
tqdm==4.66.1
typing_extensions==4.9.0
tzdata==2023.3
universal-pathlib==0.1.4
uritemplate==4.1.1
urllib3==1.26.18
uvicorn==0.24.0.post1
uvloop==0.19.0
vine==5.1.0
watchdog==3.0.0
watchfiles==0.21.0
wcwidth==0.2.12
websocket-client==1.7.0
websockets==12.0
yarl==1.9.4

1.5.13

pip list

Package                     Version
--------------------------- ------------
alembic                     1.13.0
amqp                        5.2.0
aniso8601                   9.0.1
annotated-types             0.6.0
anyio                       4.1.0
async-timeout               4.0.3
azure-core                  1.29.5
azure-identity              1.15.0
azure-storage-blob          12.19.0
azure-storage-file-datalake 12.14.0
backoff                     2.2.1
billiard                    4.2.0
boto3                       1.34.0
botocore                    1.34.0
cachetools                  5.3.2
celery                      5.3.6
certifi                     2023.11.17
cffi                        1.16.0
charset-normalizer          3.3.2
click                       8.1.7
click-didyoumean            0.3.0
click-plugins               1.1.1
click-repl                  0.3.0
coloredlogs                 14.0
croniter                    2.0.1
cryptography                41.0.7
dagster                     1.5.13
dagster-aws                 0.21.13
dagster-azure               0.21.13
dagster-celery              0.21.13
dagster-celery-k8s          0.21.13
dagster-gcp                 0.21.13
dagster-graphql             1.5.13
dagster-k8s                 0.21.13
dagster-pandas              0.21.13
dagster-pipes               1.5.13
dagster-postgres            0.21.13
dagster-webserver           1.5.13
db-dtypes                   1.2.0
docstring-parser            0.15
exceptiongroup              1.2.0
flower                      2.0.1
fsspec                      2023.12.2
google-api-core             2.15.0
google-api-python-client    2.111.0
google-auth                 2.25.2
google-auth-httplib2        0.2.0
google-cloud-bigquery       3.14.1
google-cloud-core           2.4.1
google-cloud-storage        2.14.0
google-crc32c               1.5.0
google-resumable-media      2.7.0
googleapis-common-protos    1.62.0
gql                         3.4.1
graphene                    3.3
graphql-core                3.2.3
graphql-relay               3.2.0
greenlet                    3.0.2
grpcio                      1.60.0
grpcio-health-checking      1.60.0
h11                         0.14.0
httplib2                    0.22.0
httptools                   0.6.1
humanfriendly               10.0
humanize                    4.9.0
idna                        3.6
isodate                     0.6.1
Jinja2                      3.1.2
jmespath                    1.0.1
kombu                       5.3.4
kubernetes                  28.1.0
Mako                        1.3.0
MarkupSafe                  2.1.3
msal                        1.26.0
msal-extensions             1.1.0
multidict                   6.0.4
numpy                       1.26.2
oauth2client                4.1.3
oauthlib                    3.2.2
packaging                   23.2
pandas                      2.1.4
pendulum                    2.1.2
pip                         23.0.1
portalocker                 2.8.2
prometheus-client           0.19.0
prompt-toolkit              3.0.43
protobuf                    4.25.1
psycopg2-binary             2.9.9
pyarrow                     14.0.1
pyasn1                      0.5.1
pyasn1-modules              0.3.0
pycparser                   2.21
pydantic                    2.5.2
pydantic_core               2.14.5
PyJWT                       2.8.0
pyparsing                   3.1.1
python-dateutil             2.8.2
python-dotenv               1.0.0
pytz                        2023.3.post1
pytzdata                    2020.1
PyYAML                      6.0.1
redis                       5.0.1
requests                    2.31.0
requests-oauthlib           1.3.1
requests-toolbelt           0.10.1
rsa                         4.9
s3transfer                  0.9.0
setuptools                  65.5.1
six                         1.16.0
sniffio                     1.3.0
SQLAlchemy                  2.0.23
starlette                   0.33.0
tabulate                    0.9.0
tomli                       2.0.1
toposort                    1.10
tornado                     6.4
tqdm                        4.66.1
typing_extensions           4.9.0
tzdata                      2023.3
universal-pathlib           0.1.4
uritemplate                 4.1.1
urllib3                     1.26.18
uvicorn                     0.24.0.post1
uvloop                      0.19.0
vine                        5.1.0
watchdog                    3.0.0
watchfiles                  0.21.0
wcwidth                     0.2.12
websocket-client            1.7.0
websockets                  12.0
wheel                       0.42.0
yarl                        1.9.4

pip freeze

alembic==1.13.0
amqp==5.2.0
aniso8601==9.0.1
annotated-types==0.6.0
anyio==4.1.0
async-timeout==4.0.3
azure-core==1.29.5
azure-identity==1.15.0
azure-storage-blob==12.19.0
azure-storage-file-datalake==12.14.0
backoff==2.2.1
billiard==4.2.0
boto3==1.34.0
botocore==1.34.0
cachetools==5.3.2
celery==5.3.6
certifi==2023.11.17
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
click-didyoumean==0.3.0
click-plugins==1.1.1
click-repl==0.3.0
coloredlogs==14.0
croniter==2.0.1
cryptography==41.0.7
dagster==1.5.13
dagster-aws==0.21.13
dagster-azure==0.21.13
dagster-celery==0.21.13
dagster-celery-k8s==0.21.13
dagster-gcp==0.21.13
dagster-graphql==1.5.13
dagster-k8s==0.21.13
dagster-pandas==0.21.13
dagster-pipes==1.5.13
dagster-postgres==0.21.13
dagster-webserver==1.5.13
db-dtypes==1.2.0
docstring-parser==0.15
exceptiongroup==1.2.0
flower==2.0.1
fsspec==2023.12.2
google-api-core==2.15.0
google-api-python-client==2.111.0
google-auth==2.25.2
google-auth-httplib2==0.2.0
google-cloud-bigquery==3.14.1
google-cloud-core==2.4.1
google-cloud-storage==2.14.0
google-crc32c==1.5.0
google-resumable-media==2.7.0
googleapis-common-protos==1.62.0
gql==3.4.1
graphene==3.3
graphql-core==3.2.3
graphql-relay==3.2.0
greenlet==3.0.2
grpcio==1.60.0
grpcio-health-checking==1.60.0
h11==0.14.0
httplib2==0.22.0
httptools==0.6.1
humanfriendly==10.0
humanize==4.9.0
idna==3.6
isodate==0.6.1
Jinja2==3.1.2
jmespath==1.0.1
kombu==5.3.4
kubernetes==28.1.0
Mako==1.3.0
MarkupSafe==2.1.3
msal==1.26.0
msal-extensions==1.1.0
multidict==6.0.4
numpy==1.26.2
oauth2client==4.1.3
oauthlib==3.2.2
packaging==23.2
pandas==2.1.4
pendulum==2.1.2
portalocker==2.8.2
prometheus-client==0.19.0
prompt-toolkit==3.0.43
protobuf==4.25.1
psycopg2-binary==2.9.9
pyarrow==14.0.1
pyasn1==0.5.1
pyasn1-modules==0.3.0
pycparser==2.21
pydantic==2.5.2
pydantic_core==2.14.5
PyJWT==2.8.0
pyparsing==3.1.1
python-dateutil==2.8.2
python-dotenv==1.0.0
pytz==2023.3.post1
pytzdata==2020.1
PyYAML==6.0.1
redis==5.0.1
requests==2.31.0
requests-oauthlib==1.3.1
requests-toolbelt==0.10.1
rsa==4.9
s3transfer==0.9.0
six==1.16.0
sniffio==1.3.0
SQLAlchemy==2.0.23
starlette==0.33.0
tabulate==0.9.0
tomli==2.0.1
toposort==1.10
tornado==6.4
tqdm==4.66.1
typing_extensions==4.9.0
tzdata==2023.3
universal-pathlib==0.1.4
uritemplate==4.1.1
urllib3==1.26.18
uvicorn==0.24.0.post1
uvloop==0.19.0
vine==5.1.0
watchdog==3.0.0
watchfiles==0.21.0
wcwidth==0.2.12
websocket-client==1.7.0
websockets==12.0
yarl==1.9.4

aaaaahaaaaa avatar Jan 03 '24 17:01 aaaaahaaaaa

Thanks for following up, not much interesting in the dependency changes.

I spent some time with memray looking for leaks and have so far not been able to turn anything up.

Do you have anything like automated recurring queries against the webserver?

alangenfeld avatar Jan 03 '24 20:01 alangenfeld

Do you have anything like automated recurring queries against the webserver?

Well only the readinessProbe from your chart.

Turns out we actually still observe the same behaviour after rolling back to 1.5.12. So it's not related to the new version. I'm puzzled now. I'll try to investigate further and close the issue.

aaaaahaaaaa avatar Jan 04 '24 09:01 aaaaahaaaaa

I've had luck using this tool to get a memory profile of a running process https://github.com/facebookarchive/memory-analyzer and this https://github.com/kmaork/madbg for interactive poking around at the active process. I believe these both need SYS_PTRACE capabilities given on the k8s pod spec.

Given its a webserver its also susceptible to the "type 3" leaks described here https://blog.nelhage.com/post/three-kinds-of-leaks/ python allocator arena fragmentation, but the very smooth gradient of your graphs makes me skeptical thats the cause without some sort of recurring large query causing the fragmentation.

alangenfeld avatar Jan 04 '24 15:01 alangenfeld

@aaaaahaaaaa did you find any reason why memory started growing? We have a similar issue and switching between versions didn't help yet - tried from 1.5.14 to 1.5.12.

The memory increase is quite noticeable, showing up even in daily granularity.

This issue seems to be isolated to the webserver component. Both the daemon and code servers are exhibiting stable memory usage. We are operating these as three separate containers within AWS ECS.

We have only one scheduled job active, no sensors, auto-materialized so far. Assets are loaded from dbt.

SCR-20240119-iqbx

jvyoralek avatar Jan 19 '24 08:01 jvyoralek

@jvyoralek No I didn't find the source of the problem and the issue is still occurring for us as well. Unfortunately I didn't have time to investigate further. I think there's clearly something up with the workload, we're not doing anything special either aside from deploying the helm chart.

aaaaahaaaaa avatar Jan 19 '24 08:01 aaaaahaaaaa

@alangenfeld found a memory leak that could be the cause of this, I'll let him comment but here is the PR that attempts to fix it https://github.com/dagster-io/dagster/pull/19298

salazarm avatar Jan 19 '24 15:01 salazarm

https://github.com/dagster-io/dagster/pull/19298 is a fix for a problem that manifests as very rapid unbounded memory growth resulting in process termination. I don't believe its related to this slower memory growth.

alangenfeld avatar Jan 19 '24 15:01 alangenfeld

I appear to have a similar problem after upgrading to 1.6. I run Dagster on AWS ECS using Fargate. Hence I don't believe it is my jobs causing it since the code runs on a separate task. Both the Daemon and Dagit/Web server, services, are slowly creeping up. The drops in the following chart is due to restarts. Before the upgrade to 1.6 on the 11th this problem didn't exist. image

noam-jacobson avatar Jan 25 '24 15:01 noam-jacobson

@noam-jacobson what version were you upgrading from?

alangenfeld avatar Jan 25 '24 16:01 alangenfeld

@noam-jacobson what version were you upgrading from?

I was on version 1.5.10

noam-jacobson avatar Jan 25 '24 17:01 noam-jacobson

@noam-jacobson We're having the same issue on ECS/Fargate on 1.5.7

jackwillisupside avatar Jan 30 '24 15:01 jackwillisupside

We are also having the same issue on 1.6.0, also ECS/Fargate

will-regal-voice avatar Jan 30 '24 16:01 will-regal-voice

Same here in our k8s deployment cluster. Any clue?

gasgallo avatar Feb 02 '24 07:02 gasgallo

We think we might? have solved it on our end -- we didn't have a strict retention policy on logs set in our dagster.yml and once we set it to below our memory stopped growing:

retention:
  schedule:
    purge_after_days: 90 # sets retention policy for schedule ticks of all types
  sensor:
    purge_after_days:
      skipped: 7
      failure: 90
      success: 365

jackwillisupside avatar Feb 08 '24 19:02 jackwillisupside

We think we might? have solved it on our end -- we didn't have a strict retention policy on logs set in our dagster.yml and once we set it to below our memory stopped growing:

retention:
  schedule:
    purge_after_days: 90 # sets retention policy for schedule ticks of all types
  sensor:
    purge_after_days:
      skipped: 7
      failure: 90
      success: 365

How did that impact your memory usage? Technically you'll still retain ticks for up to 365 days, thus you should not see a change in behavior in just a few days. Or did I miss something?

I've applied a similar setting on my deployment as well (way stricter than yours, for testing) and my memory is still going up, same as before.

gasgallo avatar Feb 16 '24 08:02 gasgallo

Same problem here on Open-Shift with nearly same packages (dagster 1.6.5), also PostgreSQL and slim-buster images on both daemon and dagster-webserver (separate pods). Tried with python 3.10, 3.11 and sqlalchemy<2.0 + >2.0, no luck so far, crashes every 3-4 days. Currently trying with python 3.12, dagster 1.6.6 and slim-bookworm, will see more next days...

alexknorr avatar Feb 23 '24 19:02 alexknorr

EDIT: We found out that the following is actually not working. The initial indication might have just been a fluke.

~We were having this issue and I believe that we have found the root cause to be a bug in anyio which leaked processes. The bug was introduced in 4.1.0 and fixed in 4.3.0 (last week): https://github.com/agronholm/anyio/issues/669~

~Dagster has a dependency on anyio through the following chain: dagit --> dagster-webserver --> starlette --> anyio and I believe that this issue started to appear for people whenever they rebuilt their Dagster image during the time that bug was present because a newer but buggy version of anyio would have been included in their docker image.~

~So, the solution could be to either explicitly require anyio >= 4.3.0 or to wait until people rebuild their docker images and automatically get the bug-fixed version.~

stasharrofi avatar Feb 28 '24 19:02 stasharrofi

Has anyone had success with the solution recommended by @stasharrofi ?

We have made changes, but it appears that the memory usage is still increasing.

image

I see anyio 4.3 in log

#12 1.757 Collecting dagster==1.6.6
#12 1.810   Downloading dagster-1.6.6-py3-none-any.whl (1.4 MB)
#12 1.852      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 36.1 MB/s eta 0:00:00
#12 2.037 Collecting dagster-aws==0.22.6
#12 2.042   Downloading dagster_aws-0.22.6-py3-none-any.whl (109 kB)
#12 2.048      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 109.8/109.8 kB 32.6 MB/s eta 0:00:00
#12 2.214 Collecting dagster-postgres==0.22.6
#12 2.219   Downloading dagster_postgres-0.22.6-py3-none-any.whl (20 kB)
#12 2.259 Collecting anyio==4.3.0
#12 2.263   Downloading anyio-4.3.0-py3-none-any.whl (85 kB)

jvyoralek avatar Mar 01 '24 13:03 jvyoralek

@jvyoralek It hasn't worked for me. Deployed the newest Dagster version 1.6.6 with anyio-4.3.0.

noam-jacobson avatar Mar 01 '24 14:03 noam-jacobson

@jvyoralek : No, we found out that it's not working for us either. The initial indication that it was working was probably just a fluke.

stasharrofi avatar Mar 01 '24 14:03 stasharrofi

Same issue here with an ECS deployment, packages and versions included below

image

dagster==1.6.10
dagster-graphql==1.6.10
dagster-webserver==1.6.10
dagster-postgres==0.22.10
dagster-docker==0.22.10

shivonchain avatar Apr 02 '24 15:04 shivonchain

My team experienced this issue in an OSS ECS deployment after an upgrade from 1.5.9 -> 1.6.8. It impacted the dagit/webserver and daemon services, but not independent grpc/code location services. It presented as a slow leak that would increase memory utilization over a week or so until hitting critical thresholds / crashing the service, with 1gb memory allocated to services.

We "resolved" the issue in our environments by downgrading and pinning the grpcio python package to 1.57.0.

In incremental tests we downgraded our docker image base to the image version/sha we used for our 1.5.9 deployment, reverted dagster packages from 1.6.8 back to 1.5.9, and updated python from 3.10 -> 3.11. None of these changes resolved the memory leak.

Sharing this context as it supports root cause being related to an unpinned package dependency, and not necessarily an issue with the core dagster packages. It also ruled out interaction with OS libs/OS version causing the leak.

We selected grpcio 1.57.0 because it was the version of the dep that was solved for at the time when we originally deployed 1.5.9. It's possible a more recent version would work as well.

jobicarter avatar Apr 08 '24 19:04 jobicarter

Thank you, @jobicarter, for the effective workaround. We deployed it yesterday, and although it's only been a short time, we're already seeing promising changes.

Tested with these versions:

dagster==1.7.0
dagster-webserver==1.7.0
dagster-graphql==1.7.0
dagster-aws==0.23.0
dagster-postgres==0.23.0
grpcio==1.57.0
image

jvyoralek avatar Apr 10 '24 09:04 jvyoralek

I can confirm that downgrading grpcio to 1.57.0 stops the leak.

dagster==1.5.14
dagster-aws==0.21.14
dagster-azure==0.21.14
dagster-celery==0.21.14
dagster-celery-k8s==0.21.14
dagster-gcp==0.21.14
dagster-graphql==1.5.14
dagster-k8s==0.21.14
dagster-pandas==0.21.14
dagster-pipes==1.5.14
dagster-postgres==0.21.14
dagster-webserver==1.5.14
grpcio==1.57.0
grpcio-health-checking==1.57.0

We also did try to upgrade it to 1.62.1, but that didn't seem to work.

csomh avatar Apr 18 '24 20:04 csomh

Thanks for the solution, I think this could be related to the dagster issue, https://github.com/grpc/grpc/issues/36117

G14rb avatar Apr 19 '24 11:04 G14rb

Hi All, Having similar issue with the Dagster Docker deployment to Oracle VM. Unfortunately downgrading grpcio to 1.57.0 version hasn't resolved the issue. Currently using following setup for the Dagster image. Screenshot 2024-05-14 134923 VM seems to get to OOM state circa every 8hrs now.

p-y-t-h-e-c avatar May 14 '24 12:05 p-y-t-h-e-c

We are running into the same issue on our Kubernetes cluster, having installed Dagster via the Helm chart.

Is the solution to downgrade grpcio for the dagster-webserver pod? In that case, we should build a custom Dockerfile that changes the dependencies and point to that Dockerfile in the Helm chart right?

I don't understand why Dagster hasn't pinned the grpcio version themselves to prevent this issue from happening, it seems a little strange that they are expecting users to either live with the memory leak, or manually fix the dependencies themselves.

rensoostenbachBL avatar Aug 02 '24 09:08 rensoostenbachBL