lingvo icon indicating copy to clipboard operation
lingvo copied to clipboard

waymo model not running on cloud

Open bilalsattar opened this issue 4 years ago • 22 comments

I am getting the following error on running waymo vehicle model on google cloud.

I am using model=params.waymo.StarNetVehicle

error occurs in decoder part.

2020-04-12T10:09:46.622604Z Imported params.waymo I 2020-04-12T10:09:46.622740Z Known model: waymo.StarNetBase I 2020-04-12T10:09:46.622778Z Known model: waymo.StarNetPed I 2020-04-12T10:09:46.622818Z Known model: waymo.StarNetPedFused I 2020-04-12T10:09:46.622855Z Known model: waymo.StarNetVehicle I 2020-04-12T10:09:46.626307696Z Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1854, in tf.app.run(main) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run _run_main(main, args) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1846, in main RunnerManager(FLAGS.model).Start() File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1842, in Start self.StartRunners(self.CreateRunners(FLAGS.job.split(','), FLAGS.logdir)) File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1589, in CreateRunners trial) File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1555, in _CreateRunner cfg = self.GetParamsForDataset('decoder', dataset_name) File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1401, in GetParamsForDataset cfg = self.model_registry.GetParams(self._model_name, dataset_name) File "/usr/local/lib/python3.6/dist-packages/lingvo/model_registry.py", line 266, in GetParams return _ModelRegistryHelper.GetParams(class_key, dataset_name) File "/usr/local/lib/python3.6/dist-packages/lingvo/model_registry.py", line 222, in GetParams model_params_cls = cls.GetClass(class_key) File "/usr/local/lib/python3.6/dist-packages/lingvo/model_registry.py", line 205, in GetClass class_key) LookupError: Model params.waymo.StarNetVehicle not found from list of above known models. E undefined

bilalsattar avatar Apr 12 '20 10:04 bilalsattar

i think problem arises due to '''''''''''''''''''''''''' if 'params.' in path: path = path.replace('params.', '')
'''''''''''''''''''''''''''''''' in _ModelParamsClassKey

but when we read a specific class its not removing this string.

Please update the code

bilalsattar avatar Apr 12 '20 12:04 bilalsattar

Please use model=waymo.StarNetVehicle without the leading params.

Internally we always used model name without leading "params.". The initial open source version had a bug which required using "params.model_name" but that cl fixed it.

jonathanasdf avatar Apr 13 '20 04:04 jonathanasdf

Thanks! but when i use TPU type v3-8 for training instead of v3-32(which is really expensive and only available for evaluation).

I get this error:

2020-04-12T21:48:19.580902Z params.train.max_steps: 742317, enqueue_max_steps: -1 I 2020-04-12T21:48:20.041276Z Current global_enqueue_steps: 0, local_enqueue_steps: 0, global_step: 0 I E 2020-04-12T22:13:22.503419380Z 2020-04-12 22:13:22.503093: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1586729602.502056257","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC

E 2020-04-12T22:13:22.503899242Z 2020-04-12 22:13:22.503612: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1586729602.501794292","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC

I 2020-04-12T22:13:48.555083Z Retrying as expected trainer/enqueue_op/group_deps exception: Session e092935cbb6ae33d is not found. E 2020-04-12T22:13:48.555451056Z Exception in thread SessionCloseThread:

E 2020-04-12T22:13:48.555488697Z Traceback (most recent call last): File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 757, in close tf_session.TF_CloseSession(self._session) tensorflow.python.framework.errors_impl.AbortedError: Session e092935cbb6ae33d is not found. Possibly, this master has restarted.

E 2020-04-12T22:13:48.555536843Z

E 2020-04-12T22:13:48.555549956Z

I 2020-04-12T22:13:48.557193Z Retrying as expected trainer exception: Session 336cabb6c6663617 is not found. E 2020-04-12T22:13:48.557403494Z Exception in thread SessionCloseThread:

E 2020-04-12T22:13:48.557444442Z Traceback (most recent call last): File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 757, in close tf_session.TF_CloseSession(self._session) tensorflow.python.framework.errors_impl.AbortedError: Session 336cabb6c6663617 is not found. Possibly, this master has restarted.

E 2020-04-12T22:13:48.557492977Z

E 2020-04-12T22:13:48.557500654Z

I 2020-04-12T22:13:49.558657Z Retry: caught exception: _RunLoop while running tensorflow.python.framework.errors_impl.AbortedError: Session 336cabb6c6663617 is not found. E 2020-04-12T22:13:49.558990885Z . Call failed at (most recent call last):

E 2020-04-12T22:13:49.558995265Z File "/usr/lib/python3.6/threading.py", line 884, in _bootstrap

E 2020-04-12T22:13:49.558999463Z self._bootstrap_inner()

E 2020-04-12T22:13:49.559002914Z File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner

E 2020-04-12T22:13:49.559026395Z self.run()

E 2020-04-12T22:13:49.559031655Z File "/usr/lib/python3.6/threading.py", line 864, in run

E 2020-04-12T22:13:49.559035566Z self._target(*self._args, **self._kwargs)

E 2020-04-12T22:13:49.559039028Z Traceback for above exception (most recent call last):

E 2020-04-12T22:13:49.559042430Z File "/usr/local/lib/python3.6/dist-packages/lingvo/core/retry.py", line 53, in Wrapper

E 2020-04-12T22:13:49.559046353Z return func(*args, **kwargs)

E 2020-04-12T22:13:49.559049793Z File "/usr/local/lib/python3.6/dist-packages/lingvo/base_runner.py", line 192, in _RunLoop

E 2020-04-12T22:13:49.559053341Z loop_func(*loop_args)

E 2020-04-12T22:13:49.559056558Z File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 854, in _Loop

E 2020-04-12T22:13:49.559060050Z values, outfeeds = sess.run(self._tpu_train_ops)

E 2020-04-12T22:13:49.559063518Z File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 960, in run

E 2020-04-12T22:13:49.559067034Z run_metadata_ptr)

E 2020-04-12T22:13:49.559070273Z File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1183, in _run

E 2020-04-12T22:13:49.559073761Z feed_dict_tensor, options, run_metadata)

E 2020-04-12T22:13:49.559077143Z File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1361, in _do_run

E 2020-04-12T22:13:49.559080644Z run_metadata)

E 2020-04-12T22:13:49.559083879Z File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1386, in _do_call

E 2020-04-12T22:13:49.559087367Z raise type(e)(node_def, op, message)

E 2020-04-12T22:13:49.559090815Z Waiting for 1.54 seconds before retrying.

I 2020-04-12T22:13:49.559430Z Retry: caught exception: _RunLoop while running tensorflow.python.framework.errors_impl.AbortedError: Session e092935cbb6ae33d is not found. I 2020-04-12T22:13:49.559524Z trainer started. E 2020-04-12T22:13:49.563681964Z . Call failed at (most recent call last):

E 2020-04-12T22:13:49.563686876Z File "/usr/lib/python3.6/threading.py", line 884, in _bootstrap

E 2020-04-12T22:13:49.563690661Z self._bootstrap_inner()

E 2020-04-12T22:13:49.563694032Z File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner

E 2020-04-12T22:13:49.563702797Z self.run()

E 2020-04-12T22:13:49.563706199Z File "/usr/lib/python3.6/threading.py", line 864, in run

E 2020-04-12T22:13:49.563709645Z self._target(*self._args, **self._kwargs)

E 2020-04-12T22:13:49.563712935Z File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1619, in

E 2020-04-12T22:13:49.563716959Z return lambda: runner.StartEnqueueOp(op)

E 2020-04-12T22:13:49.563720382Z Traceback for above exception (most recent call last):

E 2020-04-12T22:13:49.563724547Z File "/usr/local/lib/python3.6/dist-packages/lingvo/core/retry.py", line 53, in Wrapper

E 2020-04-12T22:13:49.563728099Z return func(*args, **kwargs)

E 2020-04-12T22:13:49.563731373Z File "/usr/local/lib/python3.6/dist-packages/lingvo/base_runner.py", line 192, in _RunLoop

E 2020-04-12T22:13:49.563734944Z loop_func(*loop_args)

E 2020-04-12T22:13:49.563738194Z File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 800, in _LoopEnqueue

E 2020-04-12T22:13:49.563741667Z return super(TrainerTpu, self)._LoopEnqueue(op, sess)

E 2020-04-12T22:13:49.563745069Z File "/usr/local/lib/python3.6/dist-packages/lingvo/base_runner.py", line 323, in _LoopEnqueue

E 2020-04-12T22:13:49.563760591Z sess.run([op])

E 2020-04-12T22:13:49.563764488Z File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 960, in run

E 2020-04-12T22:13:49.563768090Z run_metadata_ptr)

E 2020-04-12T22:13:49.563771442Z File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1183, in _run

E 2020-04-12T22:13:49.563775101Z feed_dict_tensor, options, run_metadata)

E 2020-04-12T22:13:49.563778399Z File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1361, in _do_run

E 2020-04-12T22:13:49.563781923Z run_metadata)

E 2020-04-12T22:13:49.563785127Z File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1386, in _do_call

E 2020-04-12T22:13:49.563788675Z raise type(e)(node_def, op, message)

E 2020-04-12T22:13:49.563791948Z Waiting for 1.51 seconds before retrying.

I 2020-04-12T22:13:49.563888Z trainer/enqueue_op/group_deps started. I 2020-04-12T22:13:51.734431287Z Error detected in gke_instances

bilalsattar avatar Apr 13 '20 12:04 bilalsattar

Unavailable: Socket closed and AbortedError: Session 336cabb6c6663617 is not found means one of your machines has died or is unreachable. You should check the status of your machines and restart the job.

jonathanasdf avatar Apr 13 '20 12:04 jonathanasdf

is it possible to train waymo model on cloud gpu? If yes, which one should be the best to use?

bilalsattar avatar Apr 13 '20 12:04 bilalsattar

I have tried again i am getting same error. I have checked that all the machines are running.

Although I have noted that in gke_launch.py we are using the following command

args: ["-m", "lingvo.trainer", "--mode=sync", "--alsologtostderr", "--model={model}", "--logdir={logdir}", "--tpu=$(KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS)"]

is KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS default environment variable or do we have to set it somewhere? it is responsible to provide grpc:// address and also the master in trainer.py:

========================================

cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(
        tpu=FLAGS.tpu,
        project=FLAGS.gcp_project,
        zone=FLAGS.tpu_zone,
        job_name=FLAGS.job)
    cluster_spec_dict = cluster_resolver.cluster_spec().as_dict()

    FLAGS.mode = 'sync'
    FLAGS.tf_master = cluster_resolver.master()

===============================

bilalsattar avatar Apr 13 '20 14:04 bilalsattar

I run it on cloud gpu and i got read error on waymo repository although I can open the link in the browser and can download the files? what should I do?

E 2020-04-13T21:07:34.446727260Z 	 when reading gs://waymo_open_dataset_v_1_0_0_tf_example_lingvo/v.1.0.0
 
E 2020-04-13T21:07:34.446731254Z 2020-04-13 21:07:34.365021: W lingvo/core/ops/record_batcher.cc:307] Permission denied: Error executing an HTTP request: HTTP response code 403 with body '{
 
E 2020-04-13T21:07:34.446734947Z   "error": {
 
E 2020-04-13T21:07:34.446738269Z     "code": 403,
 
E 2020-04-13T21:07:34.446741586Z     "message": "*********[email protected] does not have storage.objects.list access to waymo_open_dataset_v_1_0_0_tf_example_lingvo.",
 

bilalsattar avatar Apr 13 '20 21:04 bilalsattar

@bilalsattar Try again -- I asked the Waymo Open Dataset team to ensure access to registered users, so your job should hopefully have access now.

Please keep in mind that the data is now stale since version v.1.1.0 has been released. I'll try to get an updated version of the dataset uploaded to a bucket soon, so it's compatible with our code at HEAD.

vrv avatar Apr 13 '20 22:04 vrv

I am still having the same error.

bilalsattar avatar Apr 13 '20 23:04 bilalsattar

Hm, the only thing I can think is to verify that the account *********[email protected] has been correctly registered on the Waymo Open Dataset website; the account used to access the data must be the same as the one registered on their website.

vrv avatar Apr 13 '20 23:04 vrv

its the same as I can open the repository in web browser. Also i can use gsutil cp on it.

bilalsattar avatar Apr 13 '20 23:04 bilalsattar

okay at this point it probably makes sense to file an issue at https://github.com/waymo-research/waymo-open-dataset pointing at this issue for more directed help (we don't control that bucket or registration, so have a limited ability to debug this)

vrv avatar Apr 13 '20 23:04 vrv

they have given access to user accounts but cloud vm use service accounts (*********[email protected]) to communicate. Is there any workaround to override this?

bilalsattar avatar Apr 14 '20 00:04 bilalsattar

Only thing I can think of is to use your user account to copy the data into a bucket that you control, and use that bucket instead (since you can then also give access to your service accounts). It's a big dataset though, so I'm hoping they can figure out a solution to have one hosted copy for those who can't copy it.

vrv avatar Apr 14 '20 00:04 vrv

it's really expensive to copy from one repository to another.

bilalsattar avatar Apr 14 '20 00:04 bilalsattar

@vrv can you tell how much time it will take to train waymo vehicle model on cloud TPU?

bilalsattar avatar Apr 15 '20 16:04 bilalsattar

We used a Cloud TPU v3-32 and it trained for about 15 hours to get to the numbers reported in the paper, but we haven't trained it recently on GCP and I'm not sure if there are any bottlenecks. In particular, the model is quite input CPU bound without optimizations, so you'd need to apply the 'Fused' input layer to the model like we have done for the pedestrian example (StarNetPedFused).

vrv avatar Apr 15 '20 17:04 vrv

@vrv - Do you recall how many steps you trained for in those 15hours? I've trained on GPU for >24hrs but performance is below yours, so trying why there is a performance gap. Thanks

JWHennessey avatar May 24 '20 14:05 JWHennessey

The training settings in waymo.py should describe what we train for: 150 epochs at effective batch size 128 (split across 32 TPU cores): I believe this corresponds to around 92500 steps/updates. If you use a different configuration, the learning rate settings may need to be tuned to get similar performance — we don’t have any GPU-based baselines to compare against.

What kind of accuracy are you getting with what GPU training configuration and after how many steps? Might be helpful to know for others who choose a similar config.

vrv avatar May 24 '20 18:05 vrv

Thanks for responding.

I'm using 8 x V100 GPUS, Batch size of 1 and num_cell_centers=512 (training only). So far trained for 22K steps.

mAP/L1 for Vehicles is 0.4135 on validation.

I'll experiment with configs and learning rates some more. Sounds like I mainly just need to train for a lot longer.

JWHennessey avatar May 24 '20 18:05 JWHennessey

Yeah definitely train for longer. Also remember that using more centers at evaluation time should bump up the mAP. I don’t have numbers for vehicles with fewer centers, but you can look at the original paper for how the number of centers (and points per center) gives different results for Pedestrians — I’d assume similar curves for vehicles.

vrv avatar May 24 '20 20:05 vrv

@bilalsattar Hi, Did you run statnet with GKE?

yinjunbo avatar May 30 '20 15:05 yinjunbo