kedro
kedro copied to clipboard
Versioned does not work for spark.SparkDataSet
Description
Versioning does not work for spark.SparkDataSet. It will save the version, but immediately after saving it will give the error that it does not exist (although it does and can be read by hand). I'm a newbie, so I might be doing something wrong, however, according to the documentation, everything should be correct.
Context
I wanted to save the processed dataset with the new version
Steps to Reproduce
- Add node to prepare the pyspark dataset and return spark.SparkDataSet
- For the returned dataset, specify a path like this filepath: /data/base/result
- Run the node and get an error
Expected Result
The code will continue to work after saving the dataset version
Actual Result
VersionNotFoundError: Did not find any versions for SparkDataSet(file_format=parquet, filepath=/data/inc/.../result, load_args={}, save_args={'mode': overwrite}, version=Version(load=None, save='2022-08-22T18.30.55.332Z'))
Your Environment
- Kedro version used (pip show kedro or kedro -V): 0.18.2
- Python version used (python -V): 3.7.9
- Operating system and version: Windows 10 Home
I'm also having this issue, in my case when saving to S3. I think it's due to the way the SparkDataSet sets its glob_function
, in the case of s3://
paths it will be left as None
and glob the local FS for the versioned files. I suspect it should be using get_protocol_and_path
like the pandas.ParquetDataSet
does.
Thanks for reporting this! We'll take this into our sprint work, but we'd also be happy to accept a PR for this 🙂
Hi @Spectren, I've tried this out and versioned SparkDataSet
seems to be working fine for saving data and loading datasets locally. You might want to check out - https://kedro.readthedocs.io/en/stable/tools_integration/pyspark.html to make sure PySpark is set up correctly.
As for with s3, @alamastor, versioned SparkDataSet
also seems to be working. This might be related to permission issues with your AWS credentials. (See related issue: #1768). Kedro shows a VersionNotFoundError
when you don't have sufficient permission to read/write/list objects associated with your credentials even if the version of the dataset exists in the store. We've updated the error message (#1881).
Closing this issue but feel free to re-open if this is not resolved. :)
@jmholzer confirmed still an issue on Azure databricks
Closing this in favor of https://github.com/kedro-org/kedro-plugins/issues/117, https://github.com/kedro-org/kedro/issues/2323 and https://github.com/kedro-org/kedro-plugins/pull/114
I am quite confident this should work now, we've added warning and improve the documentation for using it correctly with Databricks.
Since this issues mixed with many different issue (i.e. permission issue with S3, incorrect path on dbfs etc) , if there are problem with this, feel free to open a new issue