dagster
dagster copied to clipboard
Dagit crash with large dagster-dbt repo (databricks-dbt)
Dagster version
0.15.8
What's the issue?
dagster-dbt setup running fine on local dagit (command dagit
) but when deploying to kubernetes dagit starts to fail.
Kubernetes setup works fine on a small dbt repo
It also works fine before adding dagster-dbt assets from existing debt repo.
dagit Kubernetes pod
fails on a loop with the errors:
Status: "CrashLoopBackOff 27"
I tried increasing Dagit resources by using the following values (see deployment section for more details)
dagit:
image:
repository: "${HOST_REPO}/${DAGSTER_IMAGE_PATH}"
tag: "${BUILD_TAG}"
pullPolicy: Always
servers:
- host: "k8s-dagster-deployment"
port: 3030
name: "k8s-dagster-deployment"
resources:
limits:
cpu: 1100m
memory: 1128Mi
requests:
cpu: 1100m
memory: 1128Mi
I installed the following dependencies with pip install -r requirements.txt
dagit==0.15.8
dagster-aws==0.15.8
dagster-celery-k8s==0.15.8
dagster-celery==0.15.8
dagster-dbt==0.15.8
dagster-graphql==0.15.8
dagster-k8s==0.15.8
dagster-pandas==0.15.8
dagster-postgres==0.15.8
dagster==0.15.8
dbt-databricks==1.1.1
Thankful if you can share other ideas.
What did you expect to happen?
Dagit should load our dbt project successfully.
See local output from Dagit
command run:

How to reproduce?
Use Dagster API grip args package option for dagster-user-deployments.
The package contains the dbt project as a subfolder inside the package, e.g. gcdsdagster/gc-ds-dbt/
Our values.yaml
uses the daft helm chart from dagster/dagster project with has the following custom settings:
dagit:
image:
repository: "${HOST}/${DAGSTER_IMAGE_PATH}"
tag: "${BUILD_TAG}"
pullPolicy: Always
servers:
- host: "k8s-dagster-deployment"
port: 3030
name: "k8s-dagster-deployment"
resources:
limits:
cpu: 1100m
memory: 1128Mi
requests:
cpu: 1100m
memory: 1128Mi
dagster-user-deployments:
deployments:
- name: "k8s-dagster-deployment"
image:
repository: "${HOST}/${USER_DEPLOYMENTS_IMAGE_PATH}"
tag: "${BUILD_TAG}"
pullPolicy: Always
dagsterApiGrpcArgs:
- "--package-name"
- "gcdsdagster"
port: 3030
pipelineRun:
image:
repository: "${HOST}/${USER_DEPLOYMENTS_IMAGE_PATH}"
tag: "${BUILD_TAG}"
pullPolicy: Always
dagsterDaemon:
image:
repository: "${HOST}/${DAGSTER_IMAGE_PATH}"
tag: "${BUILD_TAG}"
pullPolicy: Always
ingress:
enabled: true
annotations:
cert-manager.io/cluster-issuer: "<cert_path>"
nginx.ingress.kubernetes.io/proxy-body-size: "2048m"
nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
nginx.ingress.kubernetes.io/proxy-connect-timeout: "75"
dagit:
host: "${INGRESS_HOSTNAME}"
path: "/"
tls:
enabled: true
secretName: "<cert_path>"
telemetry:
enabled: false
Deployment type
Dagster Helm chart
Deployment details
During deployment initially the user-deployment-code shows some issues. See error:
dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
File "/usr/local/lib/python3.9/site-packages/dagster/_core/workspace/context.py", line 552, in _load_location
location = self._create_location_from_origin(origin)
File "/usr/local/lib/python3.9/site-packages/dagster/_core/workspace/context.py", line 476, in _create_location_from_origin
return origin.create_location()
File "/usr/local/lib/python3.9/site-packages/dagster/_core/host_representation/origin.py", line 329, in create_location
return GrpcServerRepositoryLocation(self)
File "/usr/local/lib/python3.9/site-packages/dagster/_core/host_representation/repository_location.py", line 547, in __init__
list_repositories_response = sync_list_repositories_grpc(self.client)
File "/usr/local/lib/python3.9/site-packages/dagster/_api/list_repositories.py", line 19, in sync_list_repositories_grpc
api_client.list_repositories(),
File "/usr/local/lib/python3.9/site-packages/dagster/_grpc/client.py", line 169, in list_repositories
res = self._query("ListRepositories", api_pb2.ListRepositoriesRequest)
File "/usr/local/lib/python3.9/site-packages/dagster/_grpc/client.py", line 115, in _query
raise DagsterUserCodeUnreachableError("Could not reach user code server") from e
The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses" debug_error_string = "{"created":"@1659732932.980573405","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1659732932.980572991","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}" >
File "/usr/local/lib/python3.9/site-packages/dagster/_grpc/client.py", line 112, in _query
response = getattr(stub, method)(request_type(**kwargs), timeout=timeout)
File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
After a few minutes user-deployment-code
pod stabilizes and dagit pod
starts to fail on a loop with the errors:
See Deployments:
See events log:
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.
@alangenfeld this is probably hitting the grpc timeout? Iirc we made this configurable?
@alangenfeld this is probably hitting the grpc timeout? Iirc we made this configurable?
i did add DAGSTER_GRPC_TIMEOUT_SECONDS
as an env var that can be set, its not clear to me thats the solution here.
What were the logs in the dagit
pod ? you can set logLevel
in helm values to trigger more verbose logging.
How much memory does the dagit
process take when it loads the repo locally successfully?