TensorBoardLogger.jl
TensorBoardLogger.jl copied to clipboard
Summaries require names of format `name/tag`
After updating to ProtoBuf 1.0.0 #124 I found that summaries are not logged correctly to Tensorboard. Some of them do get logged but some don't. I suspect that's because some summaries are fine but after trying to log incorrectly with some of them, the file or tensorboard stops registring the ones following that.
I prepared a minimal reproducing code by revising the Flux example to the new Flux API (the existing example uses a deprecated API) https://github.com/nomadbl/TensorBoardLogger.jl/commit/fc9ba3e0ee6c88eb96d7317fa46ce24f2e94d11d
During logging I observe an error message
[2023-07-01T23:12:43Z WARN rustboard_core::run] Read error in ./content/log/events.out.tfevents.1.68825314069885e9.lior-HP-Pavilion-Laptop-15-cs3xxx: ReadRecordError(BadLengthCrc(ChecksumError { got: MaskedCrc(0x85987b32), want: MaskedCrc(0x00000000) }))
Which after some googling I can only speculate it indicates it has something to do with multiprocessing and the file trying to get written by multiple instances of the logger in different threads.
So far I tried (without success) to fix it under that assumption by specifying the logger should lock the file:
src/TBLogger.jl, 119:
file = open(fpath, "w"; lock=true)
Any other ideas or insights are welcome. I'll try to isolate the issue using the above mentioned reproducing code.
I succeeded in altering the flux example such that the bug does not occur: https://github.com/nomadbl/TensorBoardLogger.jl/commit/e0f2245c8b2bcf319cc987653d39f6e6444a2a39
The trick was to change lines like
@info "train" loss=loss_fn(pred, y) acc=accuracy(pred, y)
into
@info "train/vals" loss=loss_fn(pred, y) acc=accuracy(pred, y)
That is, the bug is somehow related to tag names.
Since this seems to work with the workaround above I'm leaving this for now.
I suspect that this has to be fixed by setting node_name or tag correctly in Summary_Values (i.e. var"Summary.Value")
I wasn't able to determine how to do this by reading the tensorboard/tensorflow documentation. Looks like a pretty in depth understanding is required there.