TensorFlowOnSpark
TensorFlowOnSpark copied to clipboard
TensorBoard files gets deleted, Profiler returns 0 Millis for communication time!
Environment:
- Python version [3.7.7]
- Spark version [3.0.0]
- TensorFlow version [2.3.0]
- TensorFlowOnSpark version [2.2.2]
- Cluster version [Standalone]
Describe the bug: I have 2 issues regarding the TensorBoard when executing a training process of my model on 2 worker nodes:
1- The first one is that after the training process completed, the TensorBoard files get deleted immediately on worker 1 while they are kept at worker 0 although I can use TensorBoard to check details while the training process is running.
2- I am trying to profile my model to check the details of consumed time for batches 3 to 5 while training the model in the Profiler page but I get 0 ms
for communication time, more specifically the Device Collective Communication
and Device to Device Time
. However the Average Step Time
gives reasonable values like 19368.9 ms!
From the Hosts
drop-down list I can see that there is only one detected host in the cluster, not 2. Why does this happen?
Logs: If applicable, add logs to help explain your problem. Note: errors may not be fully described in the driver/console logs. Make sure to check the executor logs for possible root causes.
Spark Submit Command Line: spark-submit --master spark://master:7077 train_file.py --cluster_size 2 --epochs 1
- When using the "built-in" TensorBoard server in TFoS (triggered by supplying
tensorboard=True
), the TB server is hosted in the "chief" worker, so it has the same lifecycle as the "chief" worker. That is, it will be killed when the Spark job completes. If you want visibility after the job completes, you can write the TB events to the shared/distributed filesystem and then spawn your own TB process pointing to this location. - This sounds like more of a question for the TensorFlow team since TFoS has nothing to do with these metrics. Regardless, I'm assuming that your environment somehow isn't set up to capture this information. For example, I'm guessing that "Device Collective Communication Time" is referring to something like NCCL, which you may not have (enabled) in your setup.
There are no GPUs in the cluster so that the worker nodes depend only on CPUs to process the data. As I understood from your answer, the Device Collective Communication
time value is limited to GPU and NCCL. Isn't there any way to capture this value while using only the CPUs?