lingvo
lingvo copied to clipboard
waymo model not running on cloud
I am getting the following error on running waymo vehicle model on google cloud.
I am using model=params.waymo.StarNetVehicle
error occurs in decoder part.
2020-04-12T10:09:46.622604Z Imported params.waymo I
2020-04-12T10:09:46.622740Z Known model: waymo.StarNetBase I
2020-04-12T10:09:46.622778Z Known model: waymo.StarNetPed I
2020-04-12T10:09:46.622818Z Known model: waymo.StarNetPedFused I
2020-04-12T10:09:46.622855Z Known model: waymo.StarNetVehicle I
2020-04-12T10:09:46.626307696Z Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1854, in
i think problem arises due to
''''''''''''''''''''''''''
if 'params.' in path:
path = path.replace('params.', '')
''''''''''''''''''''''''''''''''
in _ModelParamsClassKey
but when we read a specific class its not removing this string.
Please update the code
Please use model=waymo.StarNetVehicle without the leading params.
Internally we always used model name without leading "params.". The initial open source version had a bug which required using "params.model_name" but that cl fixed it.
Thanks! but when i use TPU type v3-8 for training instead of v3-32(which is really expensive and only available for evaluation).
I get this error:
2020-04-12T21:48:19.580902Z params.train.max_steps: 742317, enqueue_max_steps: -1 I 2020-04-12T21:48:20.041276Z Current global_enqueue_steps: 0, local_enqueue_steps: 0, global_step: 0 I E 2020-04-12T22:13:22.503419380Z 2020-04-12 22:13:22.503093: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1586729602.502056257","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
E 2020-04-12T22:13:22.503899242Z 2020-04-12 22:13:22.503612: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1586729602.501794292","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
I 2020-04-12T22:13:48.555083Z Retrying as expected trainer/enqueue_op/group_deps exception: Session e092935cbb6ae33d is not found. E 2020-04-12T22:13:48.555451056Z Exception in thread SessionCloseThread:
E 2020-04-12T22:13:48.555488697Z Traceback (most recent call last): File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 757, in close tf_session.TF_CloseSession(self._session) tensorflow.python.framework.errors_impl.AbortedError: Session e092935cbb6ae33d is not found. Possibly, this master has restarted.
E 2020-04-12T22:13:48.555536843Z
E 2020-04-12T22:13:48.555549956Z
I 2020-04-12T22:13:48.557193Z Retrying as expected trainer exception: Session 336cabb6c6663617 is not found. E 2020-04-12T22:13:48.557403494Z Exception in thread SessionCloseThread:
E 2020-04-12T22:13:48.557444442Z Traceback (most recent call last): File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 757, in close tf_session.TF_CloseSession(self._session) tensorflow.python.framework.errors_impl.AbortedError: Session 336cabb6c6663617 is not found. Possibly, this master has restarted.
E 2020-04-12T22:13:48.557492977Z
E 2020-04-12T22:13:48.557500654Z
I 2020-04-12T22:13:49.558657Z Retry: caught exception: _RunLoop while running tensorflow.python.framework.errors_impl.AbortedError: Session 336cabb6c6663617 is not found. E 2020-04-12T22:13:49.558990885Z . Call failed at (most recent call last):
E 2020-04-12T22:13:49.558995265Z File "/usr/lib/python3.6/threading.py", line 884, in _bootstrap
E 2020-04-12T22:13:49.558999463Z self._bootstrap_inner()
E 2020-04-12T22:13:49.559002914Z File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
E 2020-04-12T22:13:49.559026395Z self.run()
E 2020-04-12T22:13:49.559031655Z File "/usr/lib/python3.6/threading.py", line 864, in run
E 2020-04-12T22:13:49.559035566Z self._target(*self._args, **self._kwargs)
E 2020-04-12T22:13:49.559039028Z Traceback for above exception (most recent call last):
E 2020-04-12T22:13:49.559042430Z File "/usr/local/lib/python3.6/dist-packages/lingvo/core/retry.py", line 53, in Wrapper
E 2020-04-12T22:13:49.559046353Z return func(*args, **kwargs)
E 2020-04-12T22:13:49.559049793Z File "/usr/local/lib/python3.6/dist-packages/lingvo/base_runner.py", line 192, in _RunLoop
E 2020-04-12T22:13:49.559053341Z loop_func(*loop_args)
E 2020-04-12T22:13:49.559056558Z File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 854, in _Loop
E 2020-04-12T22:13:49.559060050Z values, outfeeds = sess.run(self._tpu_train_ops)
E 2020-04-12T22:13:49.559063518Z File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 960, in run
E 2020-04-12T22:13:49.559067034Z run_metadata_ptr)
E 2020-04-12T22:13:49.559070273Z File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1183, in _run
E 2020-04-12T22:13:49.559073761Z feed_dict_tensor, options, run_metadata)
E 2020-04-12T22:13:49.559077143Z File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1361, in _do_run
E 2020-04-12T22:13:49.559080644Z run_metadata)
E 2020-04-12T22:13:49.559083879Z File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1386, in _do_call
E 2020-04-12T22:13:49.559087367Z raise type(e)(node_def, op, message)
E 2020-04-12T22:13:49.559090815Z Waiting for 1.54 seconds before retrying.
I 2020-04-12T22:13:49.559430Z Retry: caught exception: _RunLoop while running tensorflow.python.framework.errors_impl.AbortedError: Session e092935cbb6ae33d is not found. I 2020-04-12T22:13:49.559524Z trainer started. E 2020-04-12T22:13:49.563681964Z . Call failed at (most recent call last):
E 2020-04-12T22:13:49.563686876Z File "/usr/lib/python3.6/threading.py", line 884, in _bootstrap
E 2020-04-12T22:13:49.563690661Z self._bootstrap_inner()
E 2020-04-12T22:13:49.563694032Z File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
E 2020-04-12T22:13:49.563702797Z self.run()
E 2020-04-12T22:13:49.563706199Z File "/usr/lib/python3.6/threading.py", line 864, in run
E 2020-04-12T22:13:49.563709645Z self._target(*self._args, **self._kwargs)
E 2020-04-12T22:13:49.563712935Z File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1619, in
E 2020-04-12T22:13:49.563716959Z return lambda: runner.StartEnqueueOp(op)
E 2020-04-12T22:13:49.563720382Z Traceback for above exception (most recent call last):
E 2020-04-12T22:13:49.563724547Z File "/usr/local/lib/python3.6/dist-packages/lingvo/core/retry.py", line 53, in Wrapper
E 2020-04-12T22:13:49.563728099Z return func(*args, **kwargs)
E 2020-04-12T22:13:49.563731373Z File "/usr/local/lib/python3.6/dist-packages/lingvo/base_runner.py", line 192, in _RunLoop
E 2020-04-12T22:13:49.563734944Z loop_func(*loop_args)
E 2020-04-12T22:13:49.563738194Z File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 800, in _LoopEnqueue
E 2020-04-12T22:13:49.563741667Z return super(TrainerTpu, self)._LoopEnqueue(op, sess)
E 2020-04-12T22:13:49.563745069Z File "/usr/local/lib/python3.6/dist-packages/lingvo/base_runner.py", line 323, in _LoopEnqueue
E 2020-04-12T22:13:49.563760591Z sess.run([op])
E 2020-04-12T22:13:49.563764488Z File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 960, in run
E 2020-04-12T22:13:49.563768090Z run_metadata_ptr)
E 2020-04-12T22:13:49.563771442Z File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1183, in _run
E 2020-04-12T22:13:49.563775101Z feed_dict_tensor, options, run_metadata)
E 2020-04-12T22:13:49.563778399Z File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1361, in _do_run
E 2020-04-12T22:13:49.563781923Z run_metadata)
E 2020-04-12T22:13:49.563785127Z File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1386, in _do_call
E 2020-04-12T22:13:49.563788675Z raise type(e)(node_def, op, message)
E 2020-04-12T22:13:49.563791948Z Waiting for 1.51 seconds before retrying.
I 2020-04-12T22:13:49.563888Z trainer/enqueue_op/group_deps started. I 2020-04-12T22:13:51.734431287Z Error detected in gke_instances
Unavailable: Socket closed and AbortedError: Session 336cabb6c6663617 is not found means one of your machines has died or is unreachable. You should check the status of your machines and restart the job.
is it possible to train waymo model on cloud gpu? If yes, which one should be the best to use?
I have tried again i am getting same error. I have checked that all the machines are running.
Although I have noted that in gke_launch.py we are using the following command
args: ["-m", "lingvo.trainer", "--mode=sync", "--alsologtostderr", "--model={model}", "--logdir={logdir}", "--tpu=$(KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS)"]
is KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS default environment variable or do we have to set it somewhere? it is responsible to provide grpc:// address and also the master in trainer.py:
========================================
cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(
tpu=FLAGS.tpu,
project=FLAGS.gcp_project,
zone=FLAGS.tpu_zone,
job_name=FLAGS.job)
cluster_spec_dict = cluster_resolver.cluster_spec().as_dict()
FLAGS.mode = 'sync'
FLAGS.tf_master = cluster_resolver.master()
===============================
I run it on cloud gpu and i got read error on waymo repository although I can open the link in the browser and can download the files? what should I do?
E 2020-04-13T21:07:34.446727260Z when reading gs://waymo_open_dataset_v_1_0_0_tf_example_lingvo/v.1.0.0
E 2020-04-13T21:07:34.446731254Z 2020-04-13 21:07:34.365021: W lingvo/core/ops/record_batcher.cc:307] Permission denied: Error executing an HTTP request: HTTP response code 403 with body '{
E 2020-04-13T21:07:34.446734947Z "error": {
E 2020-04-13T21:07:34.446738269Z "code": 403,
E 2020-04-13T21:07:34.446741586Z "message": "*********[email protected] does not have storage.objects.list access to waymo_open_dataset_v_1_0_0_tf_example_lingvo.",
@bilalsattar Try again -- I asked the Waymo Open Dataset team to ensure access to registered users, so your job should hopefully have access now.
Please keep in mind that the data is now stale since version v.1.1.0 has been released. I'll try to get an updated version of the dataset uploaded to a bucket soon, so it's compatible with our code at HEAD.
I am still having the same error.
Hm, the only thing I can think is to verify that the account *********[email protected] has been correctly registered on the Waymo Open Dataset website; the account used to access the data must be the same as the one registered on their website.
its the same as I can open the repository in web browser. Also i can use gsutil cp on it.
okay at this point it probably makes sense to file an issue at https://github.com/waymo-research/waymo-open-dataset pointing at this issue for more directed help (we don't control that bucket or registration, so have a limited ability to debug this)
they have given access to user accounts but cloud vm use service accounts (*********[email protected]) to communicate. Is there any workaround to override this?
Only thing I can think of is to use your user account to copy the data into a bucket that you control, and use that bucket instead (since you can then also give access to your service accounts). It's a big dataset though, so I'm hoping they can figure out a solution to have one hosted copy for those who can't copy it.
it's really expensive to copy from one repository to another.
@vrv can you tell how much time it will take to train waymo vehicle model on cloud TPU?
We used a Cloud TPU v3-32 and it trained for about 15 hours to get to the numbers reported in the paper, but we haven't trained it recently on GCP and I'm not sure if there are any bottlenecks. In particular, the model is quite input CPU bound without optimizations, so you'd need to apply the 'Fused' input layer to the model like we have done for the pedestrian example (StarNetPedFused).
@vrv - Do you recall how many steps you trained for in those 15hours? I've trained on GPU for >24hrs but performance is below yours, so trying why there is a performance gap. Thanks
The training settings in waymo.py should describe what we train for: 150 epochs at effective batch size 128 (split across 32 TPU cores): I believe this corresponds to around 92500 steps/updates. If you use a different configuration, the learning rate settings may need to be tuned to get similar performance — we don’t have any GPU-based baselines to compare against.
What kind of accuracy are you getting with what GPU training configuration and after how many steps? Might be helpful to know for others who choose a similar config.
Thanks for responding.
I'm using 8 x V100 GPUS, Batch size of 1 and num_cell_centers=512 (training only). So far trained for 22K steps.
mAP/L1 for Vehicles is 0.4135 on validation.
I'll experiment with configs and learning rates some more. Sounds like I mainly just need to train for a lot longer.
Yeah definitely train for longer. Also remember that using more centers at evaluation time should bump up the mAP. I don’t have numbers for vehicles with fewer centers, but you can look at the original paper for how the number of centers (and points per center) gives different results for Pedestrians — I’d assume similar curves for vehicles.
@bilalsattar Hi, Did you run statnet with GKE?