profiler
profiler copied to clipboard
"No trace event is collected" when using tensorboard / capture_tpu_profile
At first I was trying to profile BERT in Google Cloud TPU VM(v3-8 | tpu-vm-tf-2.7.0), so I followed the guide while fine tuning BERT.
But when I press capture, it says No trace event is collected
, so I thought the problem maybe specific to TPU and posted a question at StackOverflow.
* Full log vv
Starting to trace for 1000 ms. Remaining attempt(s): 3
No trace event is collected. Automatically retrying.
Starting to trace for 1000 ms. Remaining attempt(s): 2
No trace event is collected. Automatically retrying.
Starting to trace for 1000 ms. Remaining attempt(s): 1
No trace event is collected. Automatically retrying.
Starting to trace for 1000 ms. Remaining attempt(s): 0
No trace event is collected after 4 attempt(s). Perhaps, you want to try again (with more attempts?).
Tip: increase number of attempts with --num_tracing_attempts.
After that, I thought maybe the tensorboard itself might be the problem so I followed Tensorflow Serving Readme for my personal PC(macOS 10.15 / Ubuntu 18.04) using CPU, but both of them also got stuck with same error : No trace event is collected. Automatically retrying.
.
Original issue filed at Tensorboard Issue 5517
The output from diagnose_tensorboard.py
is pasted at the original issue.
cf. Tensorboard Web toasts "Capture profile successfully. Please refresh." but after 0.5 sec it disappears and nothing happens after refresh.
Have you tried increasing the number of tracing attempts as suggested in the log? Similarly, you can try increasing the profile duration. The potential issues section of the guide has some suggestions for what could be going wrong here and some steps to try. In particular, making sure the TPU is running before capturing the trace.
@dmmolitor Yes I did. Since the error continues, I changed the TPU architecture to Node and everything worked fine. So I guess there might be some bug with non-Node architecture since tensorboard itself cannot even profile CPU in my laptop as I mentioned above. Thank you.
You are welcome. If your issue is resolved, could you please close the issue?
@dmmolitor I don't think the issue is resolved, since profiling only works in specific architecture. I'll leave it opened.
I also find myself unable to replicate https://cloud.google.com/tpu/docs/profile-tpu-vm#profile_tab in order to capture profiles on TPU VMs (TPU nodes work fine as @lackhole noted).
In my case, the Tensorboard web UI says Failed to capture profile: empty trace result.
and the
tensorboard
server records the following errors
I tensorflow/core/profiler/rpc/client/profiler_client.cc:113] Asynchronous gRPC Profile() to localhost:6000
I tensorflow/core/profiler/rpc/client/remote_profiler_session_manager.cc:96] Issued Profile gRPC to 1 clients
I tensorflow/core/profiler/rpc/client/profiler_client.cc:131] Waiting for completion.
E tensorflow/core/profiler/rpc/client/profiler_client.cc:154] Unavailable: failed to connect to all addresses
W tensorflow/core/profiler/rpc/client/capture_profile.cc:133] No trace event is collected from localhost:6000
W tensorflow/core/profiler/rpc/client/capture_profile.cc:145] localhost:6000 returned Unavailable: failed to connect to all addresses
This doesn't look like will get resolved by increasing either the number of retries or the profiling duration 🤔
I also tried the command line tool capture_tpu_profile
to no avail (think it only works with TPU nodes).
And here's my TF setup for reference -
$ python3 -m pip list | grep -E 'tensor|cloud-tpu'
cloud-tpu-client 0.10
cloud-tpu-profiler 2.4.0
tensorboard 2.6.0
tensorboard-data-server 0.6.1
tensorboard-plugin-profile 2.11.1
tensorboard-plugin-wit 1.8.1
tensorflow 2.6.5
tensorflow-addons 0.16.1
tensorflow-datasets 4.8.2
tensorflow-estimator 2.6.0
tensorflow-hub 0.12.0
tensorflow-io 0.30.0
tensorflow-io-gcs-filesystem 0.30.0
tensorflow-metadata 1.12.0
tensorflow-model-optimization 0.7.3
tensorflow-text 2.6.0
As it turns out, the localhost:6000 returned Unavailable: failed to connect to all addresses
error above was due to me forgetting to start the TF profiler server, which can be easily fixed by adding tf.profiler.experimental.server.start(6000)
to the training script.
I was then able to see the following output from the training session, signalling a successful profile capture ✌
I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
W tensorflow/core/profiler/lib/profiler_session.cc:137] Profiling is late by 25154051 nanoseconds and will start immediately.
I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
I tensorflow/core/profiler/lib/profiler_session.cc:66] Profiler session collecting data.
I tensorflow/core/profiler/rpc/profiler_service_impl.cc:67] Collecting XSpace to repository: gs://.../plugins/profile/2023_02_03_20_17_09/localhost_6000.xplane.pb
I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
On the tensorboard
server side though, there's a new error
W tensorflow/core/profiler/convert/xplane_to_tools_data.cc:226] Can not find tool: tool_names. Please update to the latest version of Tensorflow.
which prevented the resulting xplane.pb
from being correctly parsed and displayed.
Downgrading tensorboard-plugin-profile
from 2.11.1
to 2.8.0
to get it more aligned with tensorboard (2.6.0)
proved effective 🎉