kedro icon indicating copy to clipboard operation
kedro copied to clipboard

Versioned does not work for spark.SparkDataSet

Open Spectren opened this issue 2 years ago • 1 comments

Description

Versioning does not work for spark.SparkDataSet. It will save the version, but immediately after saving it will give the error that it does not exist (although it does and can be read by hand). I'm a newbie, so I might be doing something wrong, however, according to the documentation, everything should be correct.

Context

I wanted to save the processed dataset with the new version

Steps to Reproduce

  1. Add node to prepare the pyspark dataset and return spark.SparkDataSet
  2. For the returned dataset, specify a path like this filepath: /data/base/result
  3. Run the node and get an error

Expected Result

The code will continue to work after saving the dataset version

Actual Result

VersionNotFoundError: Did not find any versions for SparkDataSet(file_format=parquet, filepath=/data/inc/.../result, load_args={}, save_args={'mode': overwrite}, version=Version(load=None, save='2022-08-22T18.30.55.332Z'))

Your Environment

  • Kedro version used (pip show kedro or kedro -V): 0.18.2
  • Python version used (python -V): 3.7.9
  • Operating system and version: Windows 10 Home

Spectren avatar Aug 22 '22 19:08 Spectren

I'm also having this issue, in my case when saving to S3. I think it's due to the way the SparkDataSet sets its glob_function, in the case of s3:// paths it will be left as None and glob the local FS for the versioned files. I suspect it should be using get_protocol_and_path like the pandas.ParquetDataSet does.

alamastor avatar Sep 14 '22 06:09 alamastor

Thanks for reporting this! We'll take this into our sprint work, but we'd also be happy to accept a PR for this 🙂

merelcht avatar Sep 30 '22 13:09 merelcht

Hi @Spectren, I've tried this out and versioned SparkDataSet seems to be working fine for saving data and loading datasets locally. You might want to check out - https://kedro.readthedocs.io/en/stable/tools_integration/pyspark.html to make sure PySpark is set up correctly.

As for with s3, @alamastor, versioned SparkDataSet also seems to be working. This might be related to permission issues with your AWS credentials. (See related issue: #1768). Kedro shows a VersionNotFoundError when you don't have sufficient permission to read/write/list objects associated with your credentials even if the version of the dataset exists in the store. We've updated the error message (#1881).

Closing this issue but feel free to re-open if this is not resolved. :)

ankatiyar avatar Oct 26 '22 11:10 ankatiyar

@jmholzer confirmed still an issue on Azure databricks

datajoely avatar Jan 12 '23 17:01 datajoely

Closing this in favor of https://github.com/kedro-org/kedro-plugins/issues/117, https://github.com/kedro-org/kedro/issues/2323 and https://github.com/kedro-org/kedro-plugins/pull/114

I am quite confident this should work now, we've added warning and improve the documentation for using it correctly with Databricks.

Since this issues mixed with many different issue (i.e. permission issue with S3, incorrect path on dbfs etc) , if there are problem with this, feel free to open a new issue

noklam avatar Apr 03 '23 13:04 noklam