master-replica-0 Failed building wheel for cloudml
I've been following the tutorial from here: https://tensorflow.rstudio.com/tools/cloudml/articles/getting_started.html
Submitting a job to Google cloud that works on my local machine produces the following:
>cloudml_train("R/BuildingNetwork.R")
Submitting training job to CloudML...
Job 'cloudml_2018_02_27_220654435' successfully submitted.
View job in the Cloud Console at:
https://console.cloud.google.com/ml/jobs/cloudml_2018_02_27_220654435?project=dogvcat-196520
View logs at:
https://console.cloud.google.com/logs?resource=ml.googleapis.com%2Fjob_id%2Fcloudml_2018_02_27_220654435&project=dogvcat-196520
Check job status with: job_status("cloudml_2018_02_27_220654435")
Collect job output with: job_collect("cloudml_2018_02_27_220654435")
After collect, view with: view_run("runs/cloudml_2018_02_27_220654435")
> job_status("cloudml_2018_02_27_220654435")
$ createTime : chr "2018-02-27T22:09:40Z"
$ endTime : chr "2018-02-27T22:17:11Z"
$ errorMessage : chr "The replica master 0 exited with a non-zero status of 1."
$ jobId : chr "cloudml_2018_02_27_220654435"
$ startTime : chr "2018-02-27T22:10:04Z"
$ state : chr "FAILED"
$ trainingInput :List of 3
..$ jobDir : chr "gs://dogvcat-196520/r-cloudml/staging"
..$ region : chr "us-central1"
..$ runtimeVersion: chr "1.4"
$ trainingOutput:List of 1
..$ consumedMLUnits: num 0.09
View job in the Cloud Console at:
https://console.cloud.google.com/ml/jobs/cloudml_2018_02_27_220654435?project=dogvcat-196520
View logs at:
https://console.cloud.google.com/logs?resource=ml.googleapis.com%2Fjob_id%2Fcloudml_2018_02_27_220654435&project=dogvcat-196520
The logs show a few errors. The first is:
2018-02-27 14:10:43.490 PST
master-replica-0 Failed building wheel for cloudml
The next error is:
master-replica-0 Command "/usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-0yV9J_-build/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-VBGpOM-record/install-record.txt --single-version-externally-managed --compile --user --prefix=" failed with error code 1 in /tmp/pip-0yV9J_-build/
Followed by:
master-replica-0 Command '['pip', 'install', '--user', '--upgrade', '--force-reinstall', '--no-deps', u'cloudml-1.0.0.0.tar.gz']' returned non-zero exit status 1
Followed by:
The replica master 0 exited with a non-zero status of 1.
I'm also getting errors related to copying files throughout. Such as:
error: can't copy 'cloudml-model/datasmall': doesn't exist or not a regular file
I tried just remove the directory listed the first time (wasn't a necessary directory) but then this error just showed up for a different directory.
Any ideas?
@rtjohn I would first try training simple MNIST by running:
dir.create("mnist-train")
file.copy(system.file("examples/mnist/train.R", package = "cloudml"), "mnist-train")
setwd("mnist-train")
cloudml::cloudml_train()
Would the above script train correctly? If it does, then I would start by moving the data you want to copy into "mnist-train" and rerun to make sure data can also be copied, then I would switch to the original training script.