xla icon indicating copy to clipboard operation
xla copied to clipboard

#tf-data-service Improve error handling for SnapshotManager.

Open copybara-service[bot] opened this issue 2 years ago • 0 comments

#tf-data-service Improve error handling for SnapshotManager.

If the snapshot manager receives an error from a worker:

  1. It writes a StatusProto to an ERROR file. The error status can be recovered if the dispatcher restarts.
  2. It sends the status to other workers by returning the status to worker RPCs. The workers will then cancel the snapshot.

copybara-service[bot] avatar Mar 04 '23 02:03 copybara-service[bot]