dlt icon indicating copy to clipboard operation
dlt copied to clipboard

dlt rest api resource failing in Databricks Asset Bundles

Open manel-parloa opened this issue 3 weeks ago • 1 comments

dlt version

1.18.2

Describe the problem

What we tried to achieve

  • Build an ingestion pipeline from the Zendesk API with destination Databricks
  • Schedule the workflow via the Databricks bundle (Databricks workflows)

Issues we encountered Module Import Error in Databricks

Problem: ModuleNotFoundError: No module named 'dlt.sources' when running in Databricks environment

Root Cause: Incorrect import path - was using from dlt.sources.rest_api import rest_api_source

Solution: Changed to correct import based on dlt documentation

Before:

from dlt.sources.rest_api import rest_api_source

After:

from dlt.sources.rest_api import rest_api_source # type: ignore[import-not-found]

Resolution attempts We tried the following solution: Remove preloaded databricks modules in the notebook

AttributeError: module 'dlt' has no attribute 'source'
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File ~/.ipykernel/8130/command--1-180448381:8
      5 del sys
      7 with open(filename, "rb") as f:
----> 8   exec(compile(f.read(), filename, 'exec'))

File /Workspace/Users/ab53782c-b4a3-48ad-965f-a6bfbe6d3d18/.bundle/feat-DPA947-add-dlt-ingestion/files/pipelines/zendesk_ingestion.py:22
     19 import argparse
     20 import logging
---> 22 from ingestion_lib.external_sources.zendesk_ingestion.runner import run_zendesk_ingestion
     23 from ingestion_lib.utils.zendesk_dates import get_date_range
     24 from pyspark.sql import SparkSession

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.12/site-packages/ingestion_lib/external_sources/zendesk_ingestion/__init__.py:3
      1 """Zendesk ticket management data ingestion using dlt REST API source."""
----> 3 from .runner import run_zendesk_ingestion
      4 from .source import zendesk_ticket_source
      6 __all__ = ["run_zendesk_ingestion", "zendesk_ticket_source"]

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.12/site-packages/ingestion_lib/external_sources/zendesk_ingestion/runner.py:13
      9 import types
     11 import dlt
---> 13 from ingestion_lib.external_sources.zendesk_ingestion.source import zendesk_ticket_source
     15 # 1 Drop Databricks' post-import hook
     16 sys.meta_path = [h for h in sys.meta_path if "PostImportHook" not in repr(h)]

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.12/site-packages/ingestion_lib/external_sources/zendesk_ingestion/source.py:25
     21     if file_path and file_path.startswith("/databricks/spark/python/dlt"):
     22         del sys.modules[name]
---> 25 @dlt.source
     26 def zendesk_ticket_source(
     27     access_token: str = dlt.secrets.value,
     28     start_date: str = dlt.config.value,
     29     end_date: str = dlt.config.value,
     30 ):
     31     """Zendesk ticketing data source with date range backfill.
     32 
     33     This source extracts data from Zendesk's ticketing API endpoints for a specified
   (...)
     57         ... )
     58     """
     59     # Always use backfill mode with explicit date range

AttributeError: module 'dlt' has no attribute 'source'
Workload failed, see run output for details

Expected behavior

Expected behavior, to be able to run dlt hub sources in Databricks Workflow

Steps to reproduce

Start by running

dlt init dlthub:zendesk_ticket_management duckdb

Added resource for the job in Databricks

---
resources:
  jobs:
    zendesk_ingestion:
      name: Zendesk Ticket Management Ingestion${var.suffix_name}
      description: Ingests ticketing data from Zendesk API into bronze catalog using
        dlt
      job_clusters:
        - job_cluster_key: standard_f4_job_cluster
          new_cluster:
            data_security_mode: SINGLE_USER
            node_type_id: Standard_F4s
            driver_node_type_id: Standard_F4s
            policy_id: ${var.cluster_policy_id}
            spark_version: 17.2.x-scala2.13
            num_workers: 1
            azure_attributes:
              availability: SPOT_WITH_FALLBACK_AZURE
              spot_bid_max_price: -1
              first_on_demand: 1
            spark_conf:
              spark.databricks.delta.optimizeWrite.enabled: 'true'
              spark.databricks.delta.autoCompact.enabled: 'true'
              spark.worker.cleanup.enabled: 'true'
              spark.executorEnv.CATALOG: ${var.catalog_name}
              spark.executorEnv.REGION: ${var.region}
              spark.executorEnv.ENVIRONMENT: ${var.environment}
              spark.driverEnv.ENVIRONMENT: ${var.environment}
              spark.driverEnv.CATALOG: ${var.catalog_name}
              spark.driverEnv.REGION: ${var.region}
            custom_tags:
              branch_name: ${var.branch_name}
              revision: ${var.branch_sha}
              region: ${var.region}
      webhook_notifications:
        on_failure:
          - id: ${var.notification_destination_id}
      schedule:
        quartz_cron_expression: 0 0 3 * * ?  # Daily at 3:00 AM UTC
        timezone_id: UTC
        pause_status: ${var.schedule}
      max_concurrent_runs: 1  # Prevent overlapping runs
      parameters:
        - name: days_back
          default: 2
      tasks:
        - task_key: zendesk_ingestion_task
          job_cluster_key: standard_f4_job_cluster
          max_retries: 2
          timeout_seconds: 7200  # 2 hours
          libraries:
            - whl: ../../ingestion_lib/dist/ingestion_lib-*.whl
          spark_python_task:
            python_file: ../../pipelines/zendesk_ingestion.py
            source: WORKSPACE
            parameters: [--days_back, '{{job.parameters.days_back}}']
      queue:
        enabled: true

The ingestion is in a wheel package with the Python packages. When we run the job, it fails in the module export of rest_api

Operating system

macOS

Runtime environment

Other

Python version

3.12

dlt data source

zendesk_ticket_management, rest_api_sources

dlt destination

No response

Other deployment details

Databricks

Additional information

No response

manel-parloa avatar Nov 14 '25 15:11 manel-parloa

@manel-parloa did you also apply the init bash script as part of the "remove preloaded databricks modules in the notebook".

bayees avatar Nov 14 '25 17:11 bayees

@manel-parloa did you also apply the init bash script as part of the "remove preloaded databricks modules in the notebook".

I didn't do the bash script but I tried this: https://dlthub.com/docs/dlt-ecosystem/destinations/databricks#2-remove-preloaded-databricks-modules-in-the-notebook and it dind't work, also I am not using a notebook, when I tried in a notebook using serverless compute it works without any issue.

My current implementation is a wheel called in the python file in a Databrick job for orchestration, so using Databricks Asset Bundles

manel-parloa avatar Nov 17 '25 10:11 manel-parloa