dagster icon indicating copy to clipboard operation
dagster copied to clipboard

Dagit crash with large dagster-dbt repo (databricks-dbt)

Open machov opened this issue 2 years ago • 0 comments

Dagster version

0.15.8

What's the issue?

dagster-dbt setup running fine on local dagit (command dagit) but when deploying to kubernetes dagit starts to fail.

Kubernetes setup works fine on a small dbt repo

It also works fine before adding dagster-dbt assets from existing debt repo.

dagit Kubernetes pod fails on a loop with the errors:

Status: "CrashLoopBackOff 27"

I tried increasing Dagit resources by using the following values (see deployment section for more details)

dagit:
  image:
    repository: "${HOST_REPO}/${DAGSTER_IMAGE_PATH}"
    tag: "${BUILD_TAG}"
    pullPolicy: Always

    servers:
      - host: "k8s-dagster-deployment"
        port: 3030
        name: "k8s-dagster-deployment"

  resources:
    limits:
      cpu: 1100m
      memory: 1128Mi
    requests:
      cpu: 1100m
      memory: 1128Mi

I installed the following dependencies with pip install -r requirements.txt

dagit==0.15.8
dagster-aws==0.15.8
dagster-celery-k8s==0.15.8
dagster-celery==0.15.8
dagster-dbt==0.15.8
dagster-graphql==0.15.8
dagster-k8s==0.15.8
dagster-pandas==0.15.8
dagster-postgres==0.15.8
dagster==0.15.8
dbt-databricks==1.1.1

Thankful if you can share other ideas.

What did you expect to happen?

Dagit should load our dbt project successfully.

See local output from Dagit command run:

image

How to reproduce?

Use Dagster API grip args package option for dagster-user-deployments.

The package contains the dbt project as a subfolder inside the package, e.g. gcdsdagster/gc-ds-dbt/

Our values.yaml uses the daft helm chart from dagster/dagster project with has the following custom settings:

dagit:
  image:
    repository: "${HOST}/${DAGSTER_IMAGE_PATH}"
    tag: "${BUILD_TAG}"
    pullPolicy: Always

    servers:
      - host: "k8s-dagster-deployment"
        port: 3030
        name: "k8s-dagster-deployment"

  resources:
    limits:
      cpu: 1100m
      memory: 1128Mi
    requests:
      cpu: 1100m
      memory: 1128Mi

dagster-user-deployments:
  deployments:
    - name: "k8s-dagster-deployment"
      image:
        repository: "${HOST}/${USER_DEPLOYMENTS_IMAGE_PATH}"
        tag: "${BUILD_TAG}"
        pullPolicy: Always

      dagsterApiGrpcArgs:
        - "--package-name"
        - "gcdsdagster"
      port: 3030

pipelineRun:
  image:
    repository: "${HOST}/${USER_DEPLOYMENTS_IMAGE_PATH}"
    tag: "${BUILD_TAG}"
    pullPolicy: Always

dagsterDaemon:
  image:
    repository: "${HOST}/${DAGSTER_IMAGE_PATH}"
    tag: "${BUILD_TAG}"
    pullPolicy: Always

ingress:
  enabled: true
  annotations:
    cert-manager.io/cluster-issuer: "<cert_path>"
    nginx.ingress.kubernetes.io/proxy-body-size: "2048m"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "75"

  dagit:
    host: "${INGRESS_HOSTNAME}"
    path: "/"
    tls:
      enabled: true
      secretName: "<cert_path>"


telemetry:
  enabled: false

Deployment type

Dagster Helm chart

Deployment details

During deployment initially the user-deployment-code shows some issues. See error:

dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
  File "/usr/local/lib/python3.9/site-packages/dagster/_core/workspace/context.py", line 552, in _load_location
    location = self._create_location_from_origin(origin)
  File "/usr/local/lib/python3.9/site-packages/dagster/_core/workspace/context.py", line 476, in _create_location_from_origin
    return origin.create_location()
  File "/usr/local/lib/python3.9/site-packages/dagster/_core/host_representation/origin.py", line 329, in create_location
    return GrpcServerRepositoryLocation(self)
  File "/usr/local/lib/python3.9/site-packages/dagster/_core/host_representation/repository_location.py", line 547, in __init__
    list_repositories_response = sync_list_repositories_grpc(self.client)
  File "/usr/local/lib/python3.9/site-packages/dagster/_api/list_repositories.py", line 19, in sync_list_repositories_grpc
    api_client.list_repositories(),
  File "/usr/local/lib/python3.9/site-packages/dagster/_grpc/client.py", line 169, in list_repositories
    res = self._query("ListRepositories", api_pb2.ListRepositoriesRequest)
  File "/usr/local/lib/python3.9/site-packages/dagster/_grpc/client.py", line 115, in _query
    raise DagsterUserCodeUnreachableError("Could not reach user code server") from e
The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses" debug_error_string = "{"created":"@1659732932.980573405","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1659732932.980572991","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}" >
  File "/usr/local/lib/python3.9/site-packages/dagster/_grpc/client.py", line 112, in _query
    response = getattr(stub, method)(request_type(**kwargs), timeout=timeout)
  File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)

After a few minutes user-deployment-code pod stabilizes and dagit pod starts to fail on a loop with the errors:

See Deployments: image

See events log: image

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

machov avatar Aug 05 '22 21:08 machov

@alangenfeld this is probably hitting the grpc timeout? Iirc we made this configurable?

johannkm avatar Aug 23 '22 20:08 johannkm

@alangenfeld this is probably hitting the grpc timeout? Iirc we made this configurable?

i did add DAGSTER_GRPC_TIMEOUT_SECONDS as an env var that can be set, its not clear to me thats the solution here.

What were the logs in the dagit pod ? you can set logLevel in helm values to trigger more verbose logging.

How much memory does the dagit process take when it loads the repo locally successfully?

alangenfeld avatar Aug 23 '22 21:08 alangenfeld