kedro
kedro copied to clipboard
Unable to specify save format for SparkHIveDataSet
Description
The implementation for SparkHiveDataSet
allows the user to specify additional save arguments. This should enable a delta table to be saved which is done using the following pyspark code
table = spark.table(...)
table.write.saveAsTable("db.table", format='delta')
Which should be replicated using the following configuration
table:
type: spark.SparkHiveDataSet
database: db
table: table
write_mode: overwrite
save_args:
format: delta
However, this raises a DataSetError
because the SparkHiveDataSet
constructor gets the format from the save_args
https://github.com/kedro-org/kedro/blob/805da3279e4ff2228094aabe9adfb05d9deebd85/kedro/extras/datasets/spark/spark_hive_dataset.py#L117
but if it exists, does not remove it from self._save_args
. The error is raised when creating a hive table because there are two arguments named format
https://github.com/kedro-org/kedro/blob/805da3279e4ff2228094aabe9adfb05d9deebd85/kedro/extras/datasets/spark/spark_hive_dataset.py#L144-L151
Context
I am trying to save a table using the delta format which is possible using the pyspark API, but currently not supported using SparkHiveDataSet
. With the current implementation, the only supported format is the 'hive'
default.
A possible solution would be to pop
the 'format' value if it exists in save_args, e.g.
self._format = self._save_args.pop("format", "hive") #returns "hive" if "format" not in self._save_args.keys()
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
- Kedro version used (
pip show kedro
orkedro -V
): 0.18.0 - Python version used (
python -V
): 3.9 - Operating system and version: ubuntu 18.04
Hi @jstammers, have you tried using the DeltaTAbleDataSet?
Hi @jstammers do you still need help with this?
Hi @MerelTheisenQB, I've been able to save my dataset as a delta table using the DeltaTableDataSet.
If you think this functionality should be available using the HiveDataSet class, I'd be happy to submit a PR that implements the change I proposed above.
Otherwise, please feel free to close this issue and thanks for the help
Hi @jstammers, what are the differences for you when using DeltaTableDataSet
and SparkHiveDataSet
that makes you want to have the functionality as part of the SparkHiveDataSet
?
Hi @MerelTheisenQB , the main difference would be the fact that we have upstream processes that insert data using spark.sql
, e.g.
spark.sql("Insert into facts.volume")
which means that accessing that data using SparkHiveDataSet
is more convenient. In this example, the data are saved at /user/hive/facts.db/volume
.
In my current use-case, I am intending to use this across multiple projects where the data structure will be the same, but the underlying file locations will be different. I expect it will be easier to handle this using the hive metastore rather than parameterising the base file location, but happy to hear otherwise
Thanks for clarifying @jstammers, that makes sense. It sounds very reasonable to me to add the saving as delta table functionality to the SparkHiveDataSet
, so you're more than welcome to open a PR for it 🙂 And of course reach out here or on our Discord channel if you need any help.
@MerelTheisenQB I am not very familiar with the difference between different Spark options, but this looks like a pure implementation bug to me.
See https://github.com/quantumblacklabs/private-kedro/pull/1083/files (in the old private repo). save_args
was added specifically to support more format
.
More confident about this as I skim through the commit history of the PR. See this commit https://github.com/quantumblacklabs/private-kedro/pull/1083/commits/443c8b0bf0ada48ff9d3ae2685ab0fc9d1ab7851