Facing issue while converting tensorboard logs to Aim
🐛 Bug
Trying to convert tensorboard event log file to Aim Run, but getting below error,
One more question, do we have a way to sync tensorboard logs real-time, like while training is in-progress parallelly can we sync tensorboard logs? Currently it's cli command to sync once we have tensorboard logs in place.
Many thanks!
The lock file /mnt/c/sharath_mk/ubuntu/aim/.aim/.repo_lock is on a filesystem of type `drvfs` (device id: 14). Using soft file locks to avoid potential data corruption.
2022-07-25 15:21:57.067693: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-07-25 15:21:57.067771: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Converting TensorBoard logs: 0%| | 0/1 [00:00<?, ?it/sWARNING:tensorflow:From /home/miniconda3/lib/python3.8/site-packages/tensorflow/python/summary/summary_iterator.py:27: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and:
`tf.data.TFRecordDataset(path)`
Parsing logs in /mnt/c/sharath_mk/ubuntu/aim/tensorboard/run_tb_sync/test_tb: 0%| | 0/2 [00:00<?, ?it/s]
Converting TensorBoard logs: 0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/miniconda3/bin/aim", line 8, in <module>
sys.exit(cli_entry_point())
File "/home/miniconda3/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/miniconda3/lib/python3.8/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/miniconda3/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/miniconda3/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/miniconda3/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/miniconda3/lib/python3.8/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/miniconda3/lib/python3.8/site-packages/click/decorators.py", line 26, in new_func
return f(get_current_context(), *args, **kwargs)
File "/home/miniconda3/lib/python3.8/site-packages/aim/cli/convert/commands.py", line 39, in convert_tensorboard
parse_tb_logs(logdir, repo_inst, flat, no_cache)
File "/home/miniconda3/lib/python3.8/site-packages/aim/cli/convert/processors/tensorboard.py", line 220, in parse_tb_logs
track_val = value.tensor.float_val[0]
IndexError: list index (0) out of range
To reproduce
Log tensorbord event log file
Expected behavior
Environment
- Aim Version (e.g., 3.0.1)
- Python version
- pip version
- OS (e.g., Linux)
- Any other relevant information
Additional context
Hi @Sharathmk99! Thanks for the report.
Regarding the real time conversion, unfortunately that's not possible currently. Could you describe a little bit your use case? Isn't it possible to directly track with aim instead and avoid the conversion?
And regarding the issue, could you please provide an example tfevent file? That would help a lot for debugging the issue.
Hi @mihran113 Thank you for the response.
Unfortunately on the short term it's not possible to directly use aim as our code base integration with tensorbord is very deep and takes lot of time to directly integrate with aim
Our use case, we start the training process in the main thread and Data loaded in multiple thread in parallel. Main process(train) will log the metrics to tensorboard which indeed created event log files. In real-time we need to visualize metric and compare them with other runs, for the same we need tensorboard logs to be visible in UI real-time.
I'll try to create sample tfevent file without any company data in it and share with you asap. Many thanks.
@mihran113 I was able to fix the issue with code change of how tensors are handled. Do you think I should open PR with the changes which can help others who use tensor to log metrics?
Thanks
@Sharathmk99 would appreciate that a lot. Regarding the real-time convert, I guess setting up a cron job, that calls the convert command every 5 minutes for example, would do the trick for a short term, would that help you?
I'll check that out myself and will let you know if it works as expected.
@Sharathmk99 would appreciate that a lot. Regarding the real-time convert, I guess setting up a
cronjob, that calls theconvertcommand every 5 minutes for example, would do the trick for a short term, would that help you? I'll check that out myself and will let you know if it works as expected.
Better to accept new parameter for Run class called sync_tensorboard_dir to accept tensorboard event log directory and start a separate thread to monitor the file events and sync if any changes. Every subfolder inside the tensorboard event log directory becomes entity
What do you think @mihran113 ?
I'll open separate issue to track this request.