ml_save() and ml_load() not working in Databricks: [PATH_NOT_FOUND] Path does not exist dbfs:/path/model/metadata/part-00000.
I have been trying to use ml_save() and ml_load() in Databricks (cluster-save/cluster-load), but continually get this error:
[PATH_NOT_FOUND] Path does not exist dbfs:/path/model/metadata/part-00000.
Here is some code that recreates the issue:
suppressMessages(library(sparklyr))
suppressMessages(library(dplyr))
suppressMessages(library(arrow))
sc <- spark_connect(method = "databricks")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
model <- iris_tbl %>%
ml_linear_regression(Sepal_Length ~ Sepal_Width + Petal_Length + Petal_Width)
ml_save(model, path = "dbfs:/path/model")
loaded_model <- ml_load(sc, path = "dbfs:/path/model")
Error: org.apache.spark.sql.AnalysisException: [PATH_NOT_FOUND] Path does not exist: dbfs:/path/model/metadata/part-00000. SQLSTATE: 42K03
Looking at the dbfs:/path/model/metadata/ directory, I see this:
_SUCCESS
_committed_6424082946698911525
_started_6424082946698911525
part-00000-tid-6424082946698911525-1effc116-480f-4584-9255-768eb599e04d-63-1-c000.txt
It seems like the ml_load() function is hardcoded to look specifically for "/metadata/part-00000", although perhaps I am grossly misunderstanding that function (screen shot take from the main branch of this repo):
I have looked at this: https://github.com/sparklyr/sparklyr/issues/1088
Folks seemed to have some similar issues, but it sounds like this was resolved and cluster-save/cluster-load should be feasible. Am I missing something here?
Hi! Can you make sure you're running the latest version of sparklyr? I fixed that in the 1.9.0 release: https://github.com/sparklyr/sparklyr/commit/bf80bb9991b617c1445d57f674a226f9c9d2df1a
@edgararuiz - Thanks for getting back to me. Despite using the latest version of sparklyr (1.9.0) I am still getting an error. The error is slightly different this time:
Looking at the dbfs:/path/model/metadata/ directory, I see this:
_SUCCESS
_committed_6299138540998062560
_started_6299138540998062560
part-00000-tid-6299138540998062560-ea7d1d08-e660-4c71-831e-97a6c5222068-119-1-c000.txt
The same code reproduces the issue.
Just for confirmation, here is a screen shot of my sparklyr version:
%r
install.packages("sparklyr")
library(sparklyr)
library(dplyr)
library(arrow)
packageVersion("sparklyr")
## 1.9.0
sc <- spark_connect(method = "databricks")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
model <- iris_tbl %>%
ml_linear_regression(Sepal_Length ~ Sepal_Width + Petal_Length + Petal_Width)
ml_save(model, path = "dbfs:/path/model", overwrite = TRUE)
loaded_model <- ml_load(sc, path = "dbfs:/path/model")
Ok, what does this return in your R session?
list.files("dbfs:/path/model/metadata/")
@edgararuiz - hm interesting. Running list.files("dbfs:/path/model/metadata/") returns this:
It seems to think that it is not a directory at all:
But if I use the Databricks CLI I am able to list all of the contents of that metedata directory:
Let me know if I can provide anything else here.
Ok, thank you. What does simply running list.files() return?
@edgararuiz - Running list.files() returns all of the notebooks/files/folders in the Workspace directory of the notebook that I ran list.files() from.
I called list.files() (in a r cell) from the test_notebook at this path: Workspace/Users/myuser/Adhoc/test_notebook.ipynb
And it listed all of the notebooks/files/folders in the Adhoc folder:
Ok, is path listed? And if so, is path/model also listed?
No, neither "path" nor "path/model" is listed. It is simply whatever folders, notebooks, files that I have in my local Workspace in my Adhoc folder.
Found them, they are in the /dbfs/ subfolder, which is an issue because Spark will not let me read that relative path. Working on a solution now.
You should see it under /dbfs/path/model/
So you are saying I should see a /dbfs/ subfolder in my Adhoc directory? I don't see that folder anywhere, not in the UI or the list.files() command.
I can only see the path/model if I use the databricks CLI and do something like this:
Is that going to be an issue?
Yeah, I went ahead and revamped the entire way how ml_load() finds and reads the metadata. Before, it used the regular way R reads files to get the info from the metadata, and then it would read the pipeline via Spark. Now it will read the metadata and pipeline using the Spark Context
Would you mind trying the solution by installing the updates branch:
devtools::install_github("sparklyr/sparklyr@updates")
suppressMessages(library(sparklyr))
sc <- spark_connect(method = "databricks")
ml_load(sc, "dbfs:/path/model")
@edgararuiz - That works! Everything looks great now
Hi! Just a quick follow up. CRAN just accepted the new version of sparklyr, 1.9.1. This version contains the fix for this issue.
@edgararuiz - thank you for all of the help here. Confirmed, all looks good on our end.