dlt
dlt copied to clipboard
dlt rest api resource failing in Databricks Asset Bundles
dlt version
1.18.2
Describe the problem
What we tried to achieve
- Build an ingestion pipeline from the Zendesk API with destination Databricks
- Schedule the workflow via the Databricks bundle (Databricks workflows)
Issues we encountered Module Import Error in Databricks
Problem: ModuleNotFoundError: No module named 'dlt.sources' when running in Databricks environment
Root Cause: Incorrect import path - was using from dlt.sources.rest_api import rest_api_source
Solution: Changed to correct import based on dlt documentation
Before:
from dlt.sources.rest_api import rest_api_source
After:
from dlt.sources.rest_api import rest_api_source # type: ignore[import-not-found]
Resolution attempts We tried the following solution: Remove preloaded databricks modules in the notebook
AttributeError: module 'dlt' has no attribute 'source'
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
File ~/.ipykernel/8130/command--1-180448381:8
5 del sys
7 with open(filename, "rb") as f:
----> 8 exec(compile(f.read(), filename, 'exec'))
File /Workspace/Users/ab53782c-b4a3-48ad-965f-a6bfbe6d3d18/.bundle/feat-DPA947-add-dlt-ingestion/files/pipelines/zendesk_ingestion.py:22
19 import argparse
20 import logging
---> 22 from ingestion_lib.external_sources.zendesk_ingestion.runner import run_zendesk_ingestion
23 from ingestion_lib.utils.zendesk_dates import get_date_range
24 from pyspark.sql import SparkSession
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.12/site-packages/ingestion_lib/external_sources/zendesk_ingestion/__init__.py:3
1 """Zendesk ticket management data ingestion using dlt REST API source."""
----> 3 from .runner import run_zendesk_ingestion
4 from .source import zendesk_ticket_source
6 __all__ = ["run_zendesk_ingestion", "zendesk_ticket_source"]
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.12/site-packages/ingestion_lib/external_sources/zendesk_ingestion/runner.py:13
9 import types
11 import dlt
---> 13 from ingestion_lib.external_sources.zendesk_ingestion.source import zendesk_ticket_source
15 # 1 Drop Databricks' post-import hook
16 sys.meta_path = [h for h in sys.meta_path if "PostImportHook" not in repr(h)]
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.12/site-packages/ingestion_lib/external_sources/zendesk_ingestion/source.py:25
21 if file_path and file_path.startswith("/databricks/spark/python/dlt"):
22 del sys.modules[name]
---> 25 @dlt.source
26 def zendesk_ticket_source(
27 access_token: str = dlt.secrets.value,
28 start_date: str = dlt.config.value,
29 end_date: str = dlt.config.value,
30 ):
31 """Zendesk ticketing data source with date range backfill.
32
33 This source extracts data from Zendesk's ticketing API endpoints for a specified
(...)
57 ... )
58 """
59 # Always use backfill mode with explicit date range
AttributeError: module 'dlt' has no attribute 'source'
Workload failed, see run output for details
Expected behavior
Expected behavior, to be able to run dlt hub sources in Databricks Workflow
Steps to reproduce
Start by running
dlt init dlthub:zendesk_ticket_management duckdb
Added resource for the job in Databricks
---
resources:
jobs:
zendesk_ingestion:
name: Zendesk Ticket Management Ingestion${var.suffix_name}
description: Ingests ticketing data from Zendesk API into bronze catalog using
dlt
job_clusters:
- job_cluster_key: standard_f4_job_cluster
new_cluster:
data_security_mode: SINGLE_USER
node_type_id: Standard_F4s
driver_node_type_id: Standard_F4s
policy_id: ${var.cluster_policy_id}
spark_version: 17.2.x-scala2.13
num_workers: 1
azure_attributes:
availability: SPOT_WITH_FALLBACK_AZURE
spot_bid_max_price: -1
first_on_demand: 1
spark_conf:
spark.databricks.delta.optimizeWrite.enabled: 'true'
spark.databricks.delta.autoCompact.enabled: 'true'
spark.worker.cleanup.enabled: 'true'
spark.executorEnv.CATALOG: ${var.catalog_name}
spark.executorEnv.REGION: ${var.region}
spark.executorEnv.ENVIRONMENT: ${var.environment}
spark.driverEnv.ENVIRONMENT: ${var.environment}
spark.driverEnv.CATALOG: ${var.catalog_name}
spark.driverEnv.REGION: ${var.region}
custom_tags:
branch_name: ${var.branch_name}
revision: ${var.branch_sha}
region: ${var.region}
webhook_notifications:
on_failure:
- id: ${var.notification_destination_id}
schedule:
quartz_cron_expression: 0 0 3 * * ? # Daily at 3:00 AM UTC
timezone_id: UTC
pause_status: ${var.schedule}
max_concurrent_runs: 1 # Prevent overlapping runs
parameters:
- name: days_back
default: 2
tasks:
- task_key: zendesk_ingestion_task
job_cluster_key: standard_f4_job_cluster
max_retries: 2
timeout_seconds: 7200 # 2 hours
libraries:
- whl: ../../ingestion_lib/dist/ingestion_lib-*.whl
spark_python_task:
python_file: ../../pipelines/zendesk_ingestion.py
source: WORKSPACE
parameters: [--days_back, '{{job.parameters.days_back}}']
queue:
enabled: true
The ingestion is in a wheel package with the Python packages. When we run the job, it fails in the module export of rest_api
Operating system
macOS
Runtime environment
Other
Python version
3.12
dlt data source
zendesk_ticket_management, rest_api_sources
dlt destination
No response
Other deployment details
Databricks
Additional information
No response
@manel-parloa did you also apply the init bash script as part of the "remove preloaded databricks modules in the notebook".
@manel-parloa did you also apply the init bash script as part of the "remove preloaded databricks modules in the notebook".
I didn't do the bash script but I tried this: https://dlthub.com/docs/dlt-ecosystem/destinations/databricks#2-remove-preloaded-databricks-modules-in-the-notebook and it dind't work, also I am not using a notebook, when I tried in a notebook using serverless compute it works without any issue.
My current implementation is a wheel called in the python file in a Databrick job for orchestration, so using Databricks Asset Bundles