tensorboardX icon indicating copy to clipboard operation
tensorboardX copied to clipboard

[WIP] SummaryReader

Open dsuess opened this issue 4 years ago • 5 comments

The missing support for reading tensorboard files was raised in #318. This PR adds support for iterating over tensorboard files. It's currently work-in-progress and I want to use this PR to discuss further development.

Currently, SummaryReader reads a single tfevents file and yields the parsed Event protobuf objects similar to the summary_iterator function from tensorflow.python.summary.summary_iterator. Under the hood, I use a refactored version of PyRecordReader_New from tensorboard.compat.tensorflow_stub.pywrap_tensorflow to iterate over the records and SummaryReader only parses the protobuf Events.

How should we continue from here? One thing I wasn't sure about is whether we want to keep the current interface or convert the Event-objects into more pythonic objects, e.g. dicts.

dsuess avatar Apr 28 '20 00:04 dsuess

Thanks for your contribution! Three issues come to my mind:

  1. Whether dump sequentially or random access:
  • A typical size would be 1 GB. Since the global_step is saved in the event proto, we need to decode each event before finding the event with specific global_step. So an additional data structure is needed for fast random access.
  1. What is the format of the extracted data:
  • The image is saved in encoded format, and TensorboardX supports GIF format. It is trivial to save them as files (with the Magic Bytes?) But a better target would be a NumPy array (because it's a reader). How about the histogram or the audio plugin?
  1. Usefulness to dump different data types (scalar, image, ...) The scalar can be downloaded as json or csv file from tensorboard webpage. Image can be downloaded as well. But with a reader, users can get the scalar values without playing with json or csv file. (merit)

Personally, I want to save all images of an experiment. <- maybe too much data. Then I would like to pass an additional parameter tag to the summaryReader to filter the image I want. Here's how I would design the interface (image plugin only):

SummaryReader(filename, build_index=True)
encoded_images = reader.read_image('my_tag')
encoded_image = reader.read_image('my_tag', global_step=5)
encoded_image = reader.read_image('my_tag', global_step=7)
image = reader.read_image_as_numpy('my_tag', global_step=8)

SummaryReader(filename, build_index=False)
filenames = reader.read_image('my_tag', dump=True)
filenames = reader.dump_images('my_tag')  # I think it's better

What are your use cases?

lanpa avatar May 04 '20 18:05 lanpa

Thanks for your feedback. My main goal so far was to replace the summary_iterator from TensorFlow, which does what the current implementation does. We use it mainly for parsing the results from a tfevent file into a DataFrame for further processing or visualization.

Regarding your questions:

  1. Sequentially is definitely easier, I'd have to read the TF source code to see how they deal with random access.

  2. Agreed, it would be nice if we don't just return the raw protobuf objects, put convert them into sth more pythonic. I think it's easy for scalars and lists. Histograms could be converted to TF's histogram datastructure represented by two numpy arrays, which would be cheap too. For images, I would check if we can handle the image decoding lazily, e.g. through pillow.

dsuess avatar May 05 '20 22:05 dsuess

Hey @dsuess thanks for the PR! Our team is also very interested in this feature. I'm wondering are you still working on it and is there an ETA?

kaiwenw avatar Jul 31 '20 06:07 kaiwenw

Hi @kaiwenw, I'd love to keep working on this. What are your use cases? Currently, it's a bit rough and limited, but it does what I need.

dsuess avatar Jul 31 '20 06:07 dsuess

Hi @dsuess, we usually need to retrieve the end of the log, mostly for debugging purposes. For ex. we have hard cutoffs in integration tests, and it would be nice to retrieve end of log programmatically in Notebook as well.

As for data types, probably just need a list of scalars, histograms and maybe embeddings. (no images or audio needed)

kaiwenw avatar Jul 31 '20 17:07 kaiwenw