dbt-databricks icon indicating copy to clipboard operation
dbt-databricks copied to clipboard

Can I use dbt-python-model load S3 data as a dataframe in Databricks?

Open gaoshihang opened this issue 1 year ago • 2 comments

Hello, our requirement is as follows, and we would like to seek your advice. We want to use DBT to connect to Databricks, the source data is on S3, and we want to use PySpark to load them into a Dataframe first(Like a ETL's 'E' part), do some transform on it, then write to Databricks table.

The code for PySpark looks something like this:

origin_df = spark.read.parquet(parquetFileList)
origin_df.createOrReplaceTempView("origin_table")
final_df = spark.sql("select ...... from origin_table ......")

Can we do these things in DBT python model?

gaoshihang avatar Jan 07 '24 23:01 gaoshihang

I believe that as long as you return a dataframe, the dbt adapter will handle it. If your spark works in a Databricks notebook, I'd believe it should work with the adapter.

benc-db avatar Jan 10 '24 19:01 benc-db

@benc-db Thanks for you reply, we will have a try!

gaoshihang avatar Jan 11 '24 15:01 gaoshihang