tensorboard icon indicating copy to clipboard operation
tensorboard copied to clipboard

botocore.errorfactory.NoSuchKey when old TF Events got deleted

Open shaowei-su opened this issue 1 year ago • 3 comments

Consider Stack Overflow for getting support using TensorBoard—they have a larger community with better searchability:

https://stackoverflow.com/questions/tagged/tensorboard

Do not use this template for for setup, installation, or configuration issues. Instead, use the “installation problem” issue template:

https://github.com/tensorflow/tensorboard/issues/new?template=installation_problem.md

To report a problem with TensorBoard itself, please fill out the remainder of this template.

Environment information (required)

Please run diagnose_tensorboard.py (link below) in the same environment from which you normally run TensorFlow/TensorBoard, and paste the output here:

tensorboard==2.9.1

https://raw.githubusercontent.com/tensorflow/tensorboard/master/tensorboard/tools/diagnose_tensorboard.py

For browser-related issues, please additionally specify:

  • Browser type and version (e.g., Chrome 64.0.3282.140):
  • Screenshot, if it’s a visual issue:

Issue description

Please describe the bug as clearly as possible. How can we reproduce the problem without additional resources (including external data files and proprietary Python modules)?

When use Tensorboard to read TFEvents from S3, the deleted TFEvents from the same logdir will trigger event_file_loader exceptions as following:

Exception in thread Reloader 15:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorboard/backend/event_processing/plugin_event_multiplexer.py", line 239, in Worker
    accumulator.Reload()
  File "/usr/local/lib/python3.10/dist-packages/tensorboard/backend/event_processing/plugin_event_accumulator.py", line 183, in Reload
    for event in self._generator.Load():
  File "/usr/local/lib/python3.10/dist-packages/tensorboard/backend/event_processing/directory_watcher.py", line 88, in Load
    for event in self._LoadInternal():
  File "/usr/local/lib/python3.10/dist-packages/tensorboard/backend/event_processing/directory_watcher.py", line 118, in _LoadInternal
    for event in self._loader.Load():
  File "/usr/local/lib/python3.10/dist-packages/tensorboard/backend/event_processing/event_file_loader.py", line 270, in Load
    for event in super(EventFileLoader, self).Load():
  File "/usr/local/lib/python3.10/dist-packages/tensorboard/backend/event_processing/event_file_loader.py", line 244, in Load
    for record in super(LegacyEventFileLoader, self).Load():
  File "/usr/local/lib/python3.10/dist-packages/tensorboard/backend/event_processing/event_file_loader.py", line 178, in Load
    yield next(self._iterator)
  File "/usr/local/lib/python3.10/dist-packages/tensorboard/backend/event_processing/event_file_loader.py", line 109, in __next__
    self._reader.GetNext()
  File "/usr/local/lib/python3.10/dist-packages/tensorboard/compat/tensorflow_stub/pywrap_tensorflow.py", line 207, in GetNext
    header_str = self._read(8)
  File "/usr/local/lib/python3.10/dist-packages/tensorboard/compat/tensorflow_stub/pywrap_tensorflow.py", line 273, in _read
    new_data = self.file_handle.read(n)
  File "/usr/local/lib/python3.10/dist-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 727, in read
    (self.buff, self.continuation_token) = self.fs.read(
  File "/usr/local/lib/python3.10/dist-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 287, in read
    stream = s3.Object(bucket, path).get(**args)["Body"].read()
  File "/usr/local/lib/python3.10/dist-packages/boto3/resources/factory.py", line 520, in do_action
    response = action(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/boto3/resources/action.py", line 83, in __call__
    response = getattr(parent.meta.client, operation_name)(*args, **params)
  File "/usr/local/lib/python3.10/dist-packages/botocore/client.py", line 391, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.10/dist-packages/botocore/client.py", line 719, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.

This exception will block any new event been processed and similar issue is: https://github.com/tensorflow/tensorboard/issues/2634

shaowei-su avatar Jan 02 '24 21:01 shaowei-su

To clarify, is the issue precisely the same as #2634? i.e. deleted events cause a crash instead of being ignored or handled gracefully somehow? But this is particularly how this issue manifests with the S2 filesystem?

Just to set expectations, support for S3 filesystem is best-effort, so I doubt we'll prioritize this, but I'll check with the team.

arcra avatar Jan 04 '24 05:01 arcra

Ah, and can you clarify if this is also when TensorFlow is not installed, like in #2634? Does installing TensorFlow work around the issue?

arcra avatar Jan 04 '24 05:01 arcra

To clarify, is the issue precisely the same as https://github.com/tensorflow/tensorboard/issues/2634? i.e. deleted events cause a crash instead of being ignored or handled gracefully somehow? But this is particularly how this issue manifests with the S2 filesystem?

Yes, this is the exact issue that also occur to S3 file system.

Ah, and can you clarify if this is also when TensorFlow is not installed, like in https://github.com/tensorflow/tensorboard/issues/2634? Does installing TensorFlow work around the issue?

No native TF installed in this case and TensorBoard is using the stub version for I/O operations. Let me try it out with compatible TF installed. Thanks for the suggestions!

shaowei-su avatar Jan 04 '24 20:01 shaowei-su