TensorFlowOnSpark icon indicating copy to clipboard operation
TensorFlowOnSpark copied to clipboard

TensorBoard files gets deleted, Profiler returns 0 Millis for communication time!

Open orwa-te opened this issue 3 years ago • 2 comments

Environment:

  • Python version [3.7.7]
  • Spark version [3.0.0]
  • TensorFlow version [2.3.0]
  • TensorFlowOnSpark version [2.2.2]
  • Cluster version [Standalone]

Describe the bug: I have 2 issues regarding the TensorBoard when executing a training process of my model on 2 worker nodes:

1- The first one is that after the training process completed, the TensorBoard files get deleted immediately on worker 1 while they are kept at worker 0 although I can use TensorBoard to check details while the training process is running. 2- I am trying to profile my model to check the details of consumed time for batches 3 to 5 while training the model in the Profiler page but I get 0 ms for communication time, more specifically the Device Collective Communication and Device to Device Time. However the Average Step Time gives reasonable values like 19368.9 ms! From the Hosts drop-down list I can see that there is only one detected host in the cluster, not 2. Why does this happen?

image

Logs: If applicable, add logs to help explain your problem. Note: errors may not be fully described in the driver/console logs. Make sure to check the executor logs for possible root causes.

Spark Submit Command Line: spark-submit --master spark://master:7077 train_file.py --cluster_size 2 --epochs 1

orwa-te avatar Mar 11 '21 11:03 orwa-te

  1. When using the "built-in" TensorBoard server in TFoS (triggered by supplying tensorboard=True), the TB server is hosted in the "chief" worker, so it has the same lifecycle as the "chief" worker. That is, it will be killed when the Spark job completes. If you want visibility after the job completes, you can write the TB events to the shared/distributed filesystem and then spawn your own TB process pointing to this location.
  2. This sounds like more of a question for the TensorFlow team since TFoS has nothing to do with these metrics. Regardless, I'm assuming that your environment somehow isn't set up to capture this information. For example, I'm guessing that "Device Collective Communication Time" is referring to something like NCCL, which you may not have (enabled) in your setup.

leewyang avatar Mar 11 '21 16:03 leewyang

There are no GPUs in the cluster so that the worker nodes depend only on CPUs to process the data. As I understood from your answer, the Device Collective Communication time value is limited to GPU and NCCL. Isn't there any way to capture this value while using only the CPUs?

orwa-te avatar Mar 11 '21 20:03 orwa-te