sparklyr icon indicating copy to clipboard operation
sparklyr copied to clipboard

ml_save() and ml_load() not working in Databricks: [PATH_NOT_FOUND] Path does not exist dbfs:/path/model/metadata/part-00000.

Open henryryan17 opened this issue 1 year ago • 12 comments

I have been trying to use ml_save() and ml_load() in Databricks (cluster-save/cluster-load), but continually get this error:

[PATH_NOT_FOUND] Path does not exist dbfs:/path/model/metadata/part-00000.

Here is some code that recreates the issue:

suppressMessages(library(sparklyr)) 
suppressMessages(library(dplyr)) 
suppressMessages(library(arrow)) 

sc <- spark_connect(method = "databricks")

iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

model <- iris_tbl %>%
  ml_linear_regression(Sepal_Length ~ Sepal_Width + Petal_Length + Petal_Width)


ml_save(model, path = "dbfs:/path/model")

loaded_model <- ml_load(sc, path = "dbfs:/path/model")

Error: org.apache.spark.sql.AnalysisException: [PATH_NOT_FOUND] Path does not exist: dbfs:/path/model/metadata/part-00000. SQLSTATE: 42K03

Looking at the dbfs:/path/model/metadata/ directory, I see this:

_SUCCESS
_committed_6424082946698911525
_started_6424082946698911525
part-00000-tid-6424082946698911525-1effc116-480f-4584-9255-768eb599e04d-63-1-c000.txt

It seems like the ml_load() function is hardcoded to look specifically for "/metadata/part-00000", although perhaps I am grossly misunderstanding that function (screen shot take from the main branch of this repo):

Image

I have looked at this: https://github.com/sparklyr/sparklyr/issues/1088

Folks seemed to have some similar issues, but it sounds like this was resolved and cluster-save/cluster-load should be feasible. Am I missing something here?

henryryan17 avatar Feb 14 '25 20:02 henryryan17

Hi! Can you make sure you're running the latest version of sparklyr? I fixed that in the 1.9.0 release: https://github.com/sparklyr/sparklyr/commit/bf80bb9991b617c1445d57f674a226f9c9d2df1a

edgararuiz avatar Jun 12 '25 19:06 edgararuiz

@edgararuiz - Thanks for getting back to me. Despite using the latest version of sparklyr (1.9.0) I am still getting an error. The error is slightly different this time:

Image

Looking at the dbfs:/path/model/metadata/ directory, I see this: _SUCCESS _committed_6299138540998062560 _started_6299138540998062560 part-00000-tid-6299138540998062560-ea7d1d08-e660-4c71-831e-97a6c5222068-119-1-c000.txt

The same code reproduces the issue.

Just for confirmation, here is a screen shot of my sparklyr version:

Image

%r
install.packages("sparklyr")
library(sparklyr)

library(dplyr)
library(arrow)

packageVersion("sparklyr")
## 1.9.0

sc <- spark_connect(method = "databricks")

iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

model <- iris_tbl %>%
  ml_linear_regression(Sepal_Length ~ Sepal_Width + Petal_Length + Petal_Width)

ml_save(model, path = "dbfs:/path/model", overwrite = TRUE)

loaded_model <- ml_load(sc, path = "dbfs:/path/model")

henryryan17 avatar Jun 12 '25 20:06 henryryan17

Ok, what does this return in your R session?

list.files("dbfs:/path/model/metadata/")

edgararuiz avatar Jun 13 '25 05:06 edgararuiz

@edgararuiz - hm interesting. Running list.files("dbfs:/path/model/metadata/") returns this:

Image

It seems to think that it is not a directory at all:

Image

But if I use the Databricks CLI I am able to list all of the contents of that metedata directory:

Image

Let me know if I can provide anything else here.

henryryan17 avatar Jun 16 '25 11:06 henryryan17

Ok, thank you. What does simply running list.files() return?

edgararuiz avatar Jun 16 '25 14:06 edgararuiz

@edgararuiz - Running list.files() returns all of the notebooks/files/folders in the Workspace directory of the notebook that I ran list.files() from.

I called list.files() (in a r cell) from the test_notebook at this path: Workspace/Users/myuser/Adhoc/test_notebook.ipynb

And it listed all of the notebooks/files/folders in the Adhoc folder:

Image

henryryan17 avatar Jun 16 '25 14:06 henryryan17

Ok, is path listed? And if so, is path/model also listed?

edgararuiz avatar Jun 16 '25 15:06 edgararuiz

No, neither "path" nor "path/model" is listed. It is simply whatever folders, notebooks, files that I have in my local Workspace in my Adhoc folder.

henryryan17 avatar Jun 16 '25 15:06 henryryan17

Found them, they are in the /dbfs/ subfolder, which is an issue because Spark will not let me read that relative path. Working on a solution now.

You should see it under /dbfs/path/model/

edgararuiz avatar Jun 16 '25 16:06 edgararuiz

So you are saying I should see a /dbfs/ subfolder in my Adhoc directory? I don't see that folder anywhere, not in the UI or the list.files() command.

I can only see the path/model if I use the databricks CLI and do something like this:

Image

Is that going to be an issue?

henryryan17 avatar Jun 16 '25 16:06 henryryan17

Yeah, I went ahead and revamped the entire way how ml_load() finds and reads the metadata. Before, it used the regular way R reads files to get the info from the metadata, and then it would read the pipeline via Spark. Now it will read the metadata and pipeline using the Spark Context

Would you mind trying the solution by installing the updates branch:

devtools::install_github("sparklyr/sparklyr@updates")
suppressMessages(library(sparklyr)) 
sc <- spark_connect(method = "databricks")
ml_load(sc, "dbfs:/path/model")

edgararuiz avatar Jun 16 '25 19:06 edgararuiz

@edgararuiz - That works! Everything looks great now

henryryan17 avatar Jun 16 '25 19:06 henryryan17

Hi! Just a quick follow up. CRAN just accepted the new version of sparklyr, 1.9.1. This version contains the fix for this issue.

edgararuiz avatar Jun 30 '25 21:06 edgararuiz

@edgararuiz - thank you for all of the help here. Confirmed, all looks good on our end.

henryryan17 avatar Jul 08 '25 12:07 henryryan17