tensorboardX
tensorboardX copied to clipboard
[WIP] SummaryReader
The missing support for reading tensorboard files was raised in #318. This PR adds support for iterating over tensorboard files. It's currently work-in-progress and I want to use this PR to discuss further development.
Currently, SummaryReader
reads a single tfevents file and yields the parsed Event protobuf objects similar to the summary_iterator
function from tensorflow.python.summary.summary_iterator
. Under the hood, I use a refactored version of PyRecordReader_New
from tensorboard.compat.tensorflow_stub.pywrap_tensorflow
to iterate over the records and SummaryReader
only parses the protobuf Events.
How should we continue from here? One thing I wasn't sure about is whether we want to keep the current interface or convert the Event-objects into more pythonic objects, e.g. dicts.
Thanks for your contribution! Three issues come to my mind:
- Whether dump sequentially or random access:
- A typical size would be 1 GB. Since the
global_step
is saved in the event proto, we need to decode each event before finding the event with specific global_step. So an additional data structure is needed for fast random access.
- What is the format of the extracted data:
- The image is saved in encoded format, and TensorboardX supports GIF format. It is trivial to save them as files (with the Magic Bytes?) But a better target would be a NumPy array (because it's a
reader
). How about the histogram or the audio plugin?
- Usefulness to dump different data types (scalar, image, ...)
The scalar can be downloaded as json or csv file from tensorboard webpage. Image can be downloaded as well. But with a
reader
, users can get the scalar values without playing with json or csv file. (merit)
Personally, I want to save all images of an experiment. <- maybe too much data.
Then I would like to pass an additional parameter tag
to the summaryReader to filter the image I want. Here's how I would design the interface (image plugin only):
SummaryReader(filename, build_index=True)
encoded_images = reader.read_image('my_tag')
encoded_image = reader.read_image('my_tag', global_step=5)
encoded_image = reader.read_image('my_tag', global_step=7)
image = reader.read_image_as_numpy('my_tag', global_step=8)
SummaryReader(filename, build_index=False)
filenames = reader.read_image('my_tag', dump=True)
filenames = reader.dump_images('my_tag') # I think it's better
What are your use cases?
Thanks for your feedback. My main goal so far was to replace the summary_iterator
from TensorFlow, which does what the current implementation does. We use it mainly for parsing the results from a tfevent file into a DataFrame for further processing or visualization.
Regarding your questions:
-
Sequentially is definitely easier, I'd have to read the TF source code to see how they deal with random access.
-
Agreed, it would be nice if we don't just return the raw protobuf objects, put convert them into sth more pythonic. I think it's easy for scalars and lists. Histograms could be converted to TF's histogram datastructure represented by two numpy arrays, which would be cheap too. For images, I would check if we can handle the image decoding lazily, e.g. through pillow.
Hey @dsuess thanks for the PR! Our team is also very interested in this feature. I'm wondering are you still working on it and is there an ETA?
Hi @kaiwenw, I'd love to keep working on this. What are your use cases? Currently, it's a bit rough and limited, but it does what I need.
Hi @dsuess, we usually need to retrieve the end of the log, mostly for debugging purposes. For ex. we have hard cutoffs in integration tests, and it would be nice to retrieve end of log programmatically in Notebook as well.
As for data types, probably just need a list of scalars, histograms and maybe embeddings. (no images or audio needed)