xla
xla copied to clipboard
#tf-data-service Improve error handling for SnapshotManager.
#tf-data-service Improve error handling for SnapshotManager.
If the snapshot manager receives an error from a worker:
- It writes a StatusProto to an ERROR file. The error status can be recovered if the dispatcher restarts.
- It sends the status to other workers by returning the status to worker RPCs. The workers will then cancel the snapshot.