kedro icon indicating copy to clipboard operation
kedro copied to clipboard

Unable to specify save format for SparkHIveDataSet

Open jstammers opened this issue 2 years ago • 6 comments

Description

The implementation for SparkHiveDataSet allows the user to specify additional save arguments. This should enable a delta table to be saved which is done using the following pyspark code

table = spark.table(...)
table.write.saveAsTable("db.table", format='delta')

Which should be replicated using the following configuration

table:
  type: spark.SparkHiveDataSet
  database: db
  table: table
  write_mode: overwrite
  save_args:
    format: delta

However, this raises a DataSetError because the SparkHiveDataSet constructor gets the format from the save_args https://github.com/kedro-org/kedro/blob/805da3279e4ff2228094aabe9adfb05d9deebd85/kedro/extras/datasets/spark/spark_hive_dataset.py#L117

but if it exists, does not remove it from self._save_args. The error is raised when creating a hive table because there are two arguments named format https://github.com/kedro-org/kedro/blob/805da3279e4ff2228094aabe9adfb05d9deebd85/kedro/extras/datasets/spark/spark_hive_dataset.py#L144-L151

Context

I am trying to save a table using the delta format which is possible using the pyspark API, but currently not supported using SparkHiveDataSet. With the current implementation, the only supported format is the 'hive' default.

A possible solution would be to pop the 'format' value if it exists in save_args, e.g.

 self._format = self._save_args.pop("format", "hive")  #returns "hive" if "format" not in self._save_args.keys()

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V): 0.18.0
  • Python version used (python -V): 3.9
  • Operating system and version: ubuntu 18.04

jstammers avatar May 13 '22 15:05 jstammers

Hi @jstammers, have you tried using the DeltaTAbleDataSet?

merelcht avatar May 16 '22 12:05 merelcht

Hi @jstammers do you still need help with this?

merelcht avatar Jun 20 '22 12:06 merelcht

Hi @MerelTheisenQB, I've been able to save my dataset as a delta table using the DeltaTableDataSet.

If you think this functionality should be available using the HiveDataSet class, I'd be happy to submit a PR that implements the change I proposed above.

Otherwise, please feel free to close this issue and thanks for the help

jstammers avatar Jun 22 '22 07:06 jstammers

Hi @jstammers, what are the differences for you when using DeltaTableDataSet and SparkHiveDataSet that makes you want to have the functionality as part of the SparkHiveDataSet?

merelcht avatar Jul 11 '22 13:07 merelcht

Hi @MerelTheisenQB , the main difference would be the fact that we have upstream processes that insert data using spark.sql, e.g.

spark.sql("Insert into facts.volume")

which means that accessing that data using SparkHiveDataSet is more convenient. In this example, the data are saved at /user/hive/facts.db/volume.

In my current use-case, I am intending to use this across multiple projects where the data structure will be the same, but the underlying file locations will be different. I expect it will be easier to handle this using the hive metastore rather than parameterising the base file location, but happy to hear otherwise

jstammers avatar Jul 18 '22 10:07 jstammers

Thanks for clarifying @jstammers, that makes sense. It sounds very reasonable to me to add the saving as delta table functionality to the SparkHiveDataSet, so you're more than welcome to open a PR for it 🙂 And of course reach out here or on our Discord channel if you need any help.

merelcht avatar Jul 25 '22 10:07 merelcht

@MerelTheisenQB I am not very familiar with the difference between different Spark options, but this looks like a pure implementation bug to me.

See https://github.com/quantumblacklabs/private-kedro/pull/1083/files (in the old private repo). save_args was added specifically to support more format.

More confident about this as I skim through the commit history of the PR. See this commit https://github.com/quantumblacklabs/private-kedro/pull/1083/commits/443c8b0bf0ada48ff9d3ae2685ab0fc9d1ab7851

noklam avatar Sep 26 '22 10:09 noklam