keras icon indicating copy to clipboard operation
keras copied to clipboard

Tensorboard callback is blocking process

Open chenmoneygithub opened this issue 2 years ago • 1 comments

As pointed out by our user, the Tensorboard callback is doing blocking I/O, which means the training is halted until the writing finishes. This creates a performance bottleneck especially when writing to cloud storage.

chenmoneygithub avatar Aug 01 '22 21:08 chenmoneygithub

It's been a while since I've looked at this code, but I think this is what happens.

tf.summary.create_file_writer creates a _ResourceSummaryWriter https://github.com/tensorflow/tensorflow/blob/v2.9.1/tensorflow/python/ops/summary_ops_v2.py#L559

The summary context manager always calls writer.flush() https://github.com/tensorflow/tensorflow/blob/v2.9.1/tensorflow/python/ops/summary_ops_v2.py#L91

writer.flush() calls flush_summary_writer https://github.com/tensorflow/tensorflow/blob/v2.9.1/tensorflow/python/ops/summary_ops_v2.py#L347

Calls flush op https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/summary_kernels.cc#L114

^I think this is somehow not releasing the GIL but I haven't specifically checked that, just noticed long delays during training even running callback in another thread

Then it calls flush and InternalFlush in cpp, doing the actual flush. https://github.com/tensorflow/tensorflow/blob/v2.9.1/tensorflow/core/summary/summary_file_writer.cc#L69

mgraczyk avatar Aug 01 '22 21:08 mgraczyk