TensorBoardLogger.jl icon indicating copy to clipboard operation
TensorBoardLogger.jl copied to clipboard

Summaries require names of format `name/tag`

Open nomadbl opened this issue 2 years ago • 2 comments

After updating to ProtoBuf 1.0.0 #124 I found that summaries are not logged correctly to Tensorboard. Some of them do get logged but some don't. I suspect that's because some summaries are fine but after trying to log incorrectly with some of them, the file or tensorboard stops registring the ones following that.

I prepared a minimal reproducing code by revising the Flux example to the new Flux API (the existing example uses a deprecated API) https://github.com/nomadbl/TensorBoardLogger.jl/commit/fc9ba3e0ee6c88eb96d7317fa46ce24f2e94d11d

During logging I observe an error message

[2023-07-01T23:12:43Z WARN  rustboard_core::run] Read error in ./content/log/events.out.tfevents.1.68825314069885e9.lior-HP-Pavilion-Laptop-15-cs3xxx: ReadRecordError(BadLengthCrc(ChecksumError { got: MaskedCrc(0x85987b32), want: MaskedCrc(0x00000000) }))

Which after some googling I can only speculate it indicates it has something to do with multiprocessing and the file trying to get written by multiple instances of the logger in different threads. So far I tried (without success) to fix it under that assumption by specifying the logger should lock the file: src/TBLogger.jl, 119: file = open(fpath, "w"; lock=true)

Any other ideas or insights are welcome. I'll try to isolate the issue using the above mentioned reproducing code.

nomadbl avatar Jul 01 '23 23:07 nomadbl

I succeeded in altering the flux example such that the bug does not occur: https://github.com/nomadbl/TensorBoardLogger.jl/commit/e0f2245c8b2bcf319cc987653d39f6e6444a2a39

The trick was to change lines like @info "train" loss=loss_fn(pred, y) acc=accuracy(pred, y) into @info "train/vals" loss=loss_fn(pred, y) acc=accuracy(pred, y)

That is, the bug is somehow related to tag names.

nomadbl avatar Jul 02 '23 14:07 nomadbl

Since this seems to work with the workaround above I'm leaving this for now. I suspect that this has to be fixed by setting node_name or tag correctly in Summary_Values (i.e. var"Summary.Value") I wasn't able to determine how to do this by reading the tensorboard/tensorflow documentation. Looks like a pretty in depth understanding is required there.

nomadbl avatar Jul 02 '23 18:07 nomadbl