dbt-databricks icon indicating copy to clipboard operation
dbt-databricks copied to clipboard

DBT python model passing date variables with different data types

Open tade0726 opened this issue 1 year ago • 4 comments

Describe the feature

I noticed two data types arrived in the Python notebook for date str.

The python notebook mentioned here is the notebook dbt-databricks summits to Databricks for running python models. There is a config object in that, was defined corrodingly based on vars configured in dbt_project.yml .

As in :


# This part is user provided model code
# you will need to copy the next section to run the code
# COMMAND ----------
# this part is dbt logic for get ref work, do not modify

def ref(*args, **kwargs):
   refs = {}
   key = '.'.join(args)
   version = kwargs.get("v") or kwargs.get("version")
   if version:
       key += f".v{version}"
   dbt_load_df_function = kwargs.get("dbt_load_df_function")
   return dbt_load_df_function(refs[key])


def source(*args, dbt_load_df_function):
   sources = {"prod_staging.ciqfininstance": "`savana_prod`.`dbt_staging`.`ciqfininstance`"}
   key = '.'.join(args)
   return dbt_load_df_function(sources[key])


config_dict = {'start_date': datetime.date(2018, 7, 1), 'end_date': ''}


class config:
   def __init__(self, *args, **kwargs):
       pass

   @staticmethod
   def get(key, default=None):
       return config_dict.get(key, default)

class this:
   """dbt.this() or dbt.this.identifier"""
   database = "savana_prod"
   schema = "dbt_gold_pipelines"
   identifier = "int_date_table"
   
   def __repr__(self):
       return '`savana_prod`.`dbt_gold_pipelines`.`int_date_table`'


class dbtObj:
   def __init__(self, load_df_function) -> None:
       self.source = lambda *args: source(*args, dbt_load_df_function=load_df_function)
       self.ref = lambda *args, **kwargs: ref(*args, **kwargs, dbt_load_df_function=load_df_function)
       self.config = config
       self.this = this()
       self.is_incremental = False

# COMMAND ----------

# how to execute python model in notebook
# dbt = dbtObj(spark.table)
# df = model(dbt, spark)

Case 1:

When a date variable has been set up in dbt_project.yml, the Python notebook in Databricks will set that object as a DateTime object.

As in a config Python dictionary

{"start_date", datetime.date(2021, 1, 1)}

Case 2:

When a date variable passed with --vars "{'date': '2021-01-01'}" in dbt cmd ,

It arrives as

{"start_date", '2021-01-01'}

Describe alternatives you've considered

I suggest having just a string for the date object.

Additional context

I am implementing an integration with Dagster + Databricks cloud + dbt, and I need both kinds of configuration to work in the same expectation of returned variable; otherwise, it will be redundant code copying and pasting in different Python models for facilitating the logic to reconcile the difference.

As in dbt Python models can not reuse the same abstracted functions / etc for now.

Who will this benefit?

It will no longer require writing extra logic to distinguish the date types.

Integration with Dagster might depend on --var {} CLI to pass configuration. But, it would be easier to debug in the dbt project when we pass the identical variable setups with the yml file.

Are you interested in contributing to this feature?

Yes, I would love to.

tade0726 avatar Feb 09 '24 07:02 tade0726

@tade0726 you can ensure that by doing a little code adjustment to case 1:

instead of

{"start_date", datetime.date(2021, 1, 1)}

write

{"start_date", datetime.date(2021, 1, 1).strftime("%d.%m.%Y")}

see also: Python Dates

leo-schick avatar Feb 12 '24 07:02 leo-schick

Hi @leo-schick ,

It is not related to the user-defined dictionary. It was a config template generated by dbt-databricks.

I also updated the issue description for clarity.

See example below:

# This part is user provided model code
# you will need to copy the next section to run the code
# COMMAND ----------
# this part is dbt logic for get ref work, do not modify

def ref(*args, **kwargs):
    refs = {}
    key = '.'.join(args)
    version = kwargs.get("v") or kwargs.get("version")
    if version:
        key += f".v{version}"
    dbt_load_df_function = kwargs.get("dbt_load_df_function")
    return dbt_load_df_function(refs[key])


def source(*args, dbt_load_df_function):
    sources = {"prod_staging.ciqfininstance": "`savana_prod`.`dbt_staging`.`ciqfininstance`"}
    key = '.'.join(args)
    return dbt_load_df_function(sources[key])


config_dict = {'start_date': datetime.date(2018, 7, 1), 'end_date': ''}


class config:
    def __init__(self, *args, **kwargs):
        pass

    @staticmethod
    def get(key, default=None):
        return config_dict.get(key, default)

class this:
    """dbt.this() or dbt.this.identifier"""
    database = "savana_prod"
    schema = "dbt_gold_pipelines"
    identifier = "int_date_table"
    
    def __repr__(self):
        return '`savana_prod`.`dbt_gold_pipelines`.`int_date_table`'


class dbtObj:
    def __init__(self, load_df_function) -> None:
        self.source = lambda *args: source(*args, dbt_load_df_function=load_df_function)
        self.ref = lambda *args, **kwargs: ref(*args, **kwargs, dbt_load_df_function=load_df_function)
        self.config = config
        self.this = this()
        self.is_incremental = False

# COMMAND ----------

# how to execute python model in notebook
# dbt = dbtObj(spark.table)
# df = model(dbt, spark)

tade0726 avatar Feb 12 '24 07:02 tade0726

@tade0726 Please go ahead and submit a PR :). I agree that it makes most sense for the data type to be consistent regardless of source.

benc-db avatar Feb 12 '24 16:02 benc-db