master-replica-0 Failed building wheel for cloudml

Open rtjohn opened this issue 7 years ago • 1 comments

I've been following the tutorial from here: https://tensorflow.rstudio.com/tools/cloudml/articles/getting_started.html

Submitting a job to Google cloud that works on my local machine produces the following:

>cloudml_train("R/BuildingNetwork.R")
Submitting training job to CloudML...
Job 'cloudml_2018_02_27_220654435' successfully submitted.

View job in the Cloud Console at:
https://console.cloud.google.com/ml/jobs/cloudml_2018_02_27_220654435?project=dogvcat-196520

View logs at:
https://console.cloud.google.com/logs?resource=ml.googleapis.com%2Fjob_id%2Fcloudml_2018_02_27_220654435&project=dogvcat-196520

Check job status with:     job_status("cloudml_2018_02_27_220654435")

Collect job output with:   job_collect("cloudml_2018_02_27_220654435")

After collect, view with:  view_run("runs/cloudml_2018_02_27_220654435")

> job_status("cloudml_2018_02_27_220654435")
 $ createTime    : chr "2018-02-27T22:09:40Z"
 $ endTime       : chr "2018-02-27T22:17:11Z"
 $ errorMessage  : chr "The replica master 0 exited with a non-zero status of 1."
 $ jobId         : chr "cloudml_2018_02_27_220654435"
 $ startTime     : chr "2018-02-27T22:10:04Z"
 $ state         : chr "FAILED"
 $ trainingInput :List of 3
  ..$ jobDir        : chr "gs://dogvcat-196520/r-cloudml/staging"
  ..$ region        : chr "us-central1"
  ..$ runtimeVersion: chr "1.4"
 $ trainingOutput:List of 1
  ..$ consumedMLUnits: num 0.09

View job in the Cloud Console at:
https://console.cloud.google.com/ml/jobs/cloudml_2018_02_27_220654435?project=dogvcat-196520

View logs at:
https://console.cloud.google.com/logs?resource=ml.googleapis.com%2Fjob_id%2Fcloudml_2018_02_27_220654435&project=dogvcat-196520

The logs show a few errors. The first is:

2018-02-27 14:10:43.490 PST
master-replica-0 Failed building wheel for cloudml

The next error is:

master-replica-0 Command "/usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-0yV9J_-build/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-VBGpOM-record/install-record.txt --single-version-externally-managed --compile --user --prefix=" failed with error code 1 in /tmp/pip-0yV9J_-build/

Followed by:

master-replica-0  Command '['pip', 'install', '--user', '--upgrade', '--force-reinstall', '--no-deps', u'cloudml-1.0.0.0.tar.gz']' returned non-zero exit status 1

Followed by:

The replica master 0 exited with a non-zero status of 1.

I'm also getting errors related to copying files throughout. Such as:

error: can't copy 'cloudml-model/datasmall': doesn't exist or not a regular file

I tried just remove the directory listed the first time (wasn't a necessary directory) but then this error just showed up for a different directory.

Any ideas?

Feb 27 '18 22:02 rtjohn

@rtjohn I would first try training simple MNIST by running:

dir.create("mnist-train")
file.copy(system.file("examples/mnist/train.R", package = "cloudml"), "mnist-train")
setwd("mnist-train")
cloudml::cloudml_train()

Would the above script train correctly? If it does, then I would start by moving the data you want to copy into "mnist-train" and rerun to make sure data can also be copied, then I would switch to the original training script.

Mar 28 '18 22:03 javierluraschi