kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

String interpolation in `ManagedTableDataSet` leads to error on Databricks

Open PetitLepton opened this issue 2 years ago • 1 comments

Description & context

The upsert method in ManagedTableDataSet is using string-interpolation to pass the table name in the SQL request, see here. The string interpolation is not using regular Python interpolation (using f-string for example) but an internal mechanism of pyspark using a variable set in the configuration.

For an unknown-to-me reason, this leads to a weird bug when used on Databricks where the interpolation is incorrect when the table name contains hourl (sic). Below is the result of a minimal example illustrating the issue

Screenshot 2023-10-05 at 16-14-05 Untitled Notebook 2023-10-05 13 34 59 - Databricks

I don't know why this interpolation mechanism was used. If you think that we could replace it by f-string interpolation, I can make a PR in that sense.

Thanks in advance!

Steps to Reproduce

On Databricks, try the following

full_table_location = "`my_catalog`.`my_schema`.`my_table_hourl`"
spark.conf.set("fullTableName", full_table_location)
spark.sql("SELECT * FROM ${fullTableName} LIMIT 1").display()

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used: 0.18.12
  • Kedro datasets: 1.5.3
  • Python version: 3.10
  • Operating system and version: databricks

PetitLepton avatar Oct 05 '23 14:10 PetitLepton

Thanks @PetitLepton , I'm tagging this as a bug but might take us some time to triage it, please bear with us in the meantime.

astrojuanlu avatar Oct 06 '23 09:10 astrojuanlu

Is this a valid way to do interpolation? It works in config but the example here is pure python code, could you use f-string instead?

noklam avatar Oct 03 '24 13:10 noklam