aim icon indicating copy to clipboard operation
aim copied to clipboard

Facing issue while converting tensorboard logs to Aim

Open Sharathmk99 opened this issue 3 years ago • 6 comments

🐛 Bug

Trying to convert tensorboard event log file to Aim Run, but getting below error,

One more question, do we have a way to sync tensorboard logs real-time, like while training is in-progress parallelly can we sync tensorboard logs? Currently it's cli command to sync once we have tensorboard logs in place.

Many thanks!

The lock file /mnt/c/sharath_mk/ubuntu/aim/.aim/.repo_lock is on a filesystem of type `drvfs` (device id: 14). Using soft file locks to avoid potential data corruption.
2022-07-25 15:21:57.067693: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-07-25 15:21:57.067771: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Converting TensorBoard logs:   0%|                                                                                                                         | 0/1 [00:00<?, ?it/sWARNING:tensorflow:From /home/miniconda3/lib/python3.8/site-packages/tensorflow/python/summary/summary_iterator.py:27: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
Parsing logs in /mnt/c/sharath_mk/ubuntu/aim/tensorboard/run_tb_sync/test_tb:   0%|                                                              | 0/2 [00:00<?, ?it/s]
Converting TensorBoard logs:   0%|                                                                                                                         | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/miniconda3/bin/aim", line 8, in <module>
    sys.exit(cli_entry_point())
  File "/home/miniconda3/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/miniconda3/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/miniconda3/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/miniconda3/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/miniconda3/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/miniconda3/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/miniconda3/lib/python3.8/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/miniconda3/lib/python3.8/site-packages/aim/cli/convert/commands.py", line 39, in convert_tensorboard
    parse_tb_logs(logdir, repo_inst, flat, no_cache)
  File "/home/miniconda3/lib/python3.8/site-packages/aim/cli/convert/processors/tensorboard.py", line 220, in parse_tb_logs
    track_val = value.tensor.float_val[0]
IndexError: list index (0) out of range

To reproduce

Log tensorbord event log file

Expected behavior

Environment

  • Aim Version (e.g., 3.0.1)
  • Python version
  • pip version
  • OS (e.g., Linux)
  • Any other relevant information

Additional context

Sharathmk99 avatar Jul 25 '22 14:07 Sharathmk99

Hi @Sharathmk99! Thanks for the report. Regarding the real time conversion, unfortunately that's not possible currently. Could you describe a little bit your use case? Isn't it possible to directly track with aim instead and avoid the conversion? And regarding the issue, could you please provide an example tfevent file? That would help a lot for debugging the issue.

mihran113 avatar Jul 25 '22 14:07 mihran113

Hi @mihran113 Thank you for the response. Unfortunately on the short term it's not possible to directly use aim as our code base integration with tensorbord is very deep and takes lot of time to directly integrate with aim

Our use case, we start the training process in the main thread and Data loaded in multiple thread in parallel. Main process(train) will log the metrics to tensorboard which indeed created event log files. In real-time we need to visualize metric and compare them with other runs, for the same we need tensorboard logs to be visible in UI real-time.

I'll try to create sample tfevent file without any company data in it and share with you asap. Many thanks.

Sharathmk99 avatar Jul 25 '22 14:07 Sharathmk99

@mihran113 I was able to fix the issue with code change of how tensors are handled. Do you think I should open PR with the changes which can help others who use tensor to log metrics?

Thanks

Sharathmk99 avatar Jul 25 '22 22:07 Sharathmk99

@Sharathmk99 would appreciate that a lot. Regarding the real-time convert, I guess setting up a cron job, that calls the convert command every 5 minutes for example, would do the trick for a short term, would that help you? I'll check that out myself and will let you know if it works as expected.

mihran113 avatar Jul 26 '22 10:07 mihran113

@Sharathmk99 would appreciate that a lot. Regarding the real-time convert, I guess setting up a cron job, that calls the convert command every 5 minutes for example, would do the trick for a short term, would that help you? I'll check that out myself and will let you know if it works as expected.

Better to accept new parameter for Run class called sync_tensorboard_dir to accept tensorboard event log directory and start a separate thread to monitor the file events and sync if any changes. Every subfolder inside the tensorboard event log directory becomes entity What do you think @mihran113 ?

Sharathmk99 avatar Jul 31 '22 08:07 Sharathmk99

I'll open separate issue to track this request.

Sharathmk99 avatar Jul 31 '22 08:07 Sharathmk99