cassandra-medusa
cassandra-medusa copied to clipboard
gRPC server's BackupStatus cancels AsyncBackup's future
We've recently made the AsyncBackup really Async, and then went and "fixed" the BackupStatus endpoint.
However, this fix introduced a bug. Calling the register_backup in the status endpoint is not correct. Re-registering the backup means the previously registered backup gets removed, and it's pending future cancelled.
This means that the running instance of grpc server (and the backupman specifically) will never get to handle completion of the future.
The backup itself is not cancelled. I've observed this on a cluster, where an exception was thrown:
[2024-04-11 09:00:16,131] INFO: Recording async backup information.
[2024-04-11 09:00:16,131] ERROR: Exception in callback record_backup_info(<Future cancelled>) at /home/cassandra/medusa/service/grpc/server.py:346
handle: <Handle record_backup_info(<Future cancelled>) at /home/cassandra/medusa/service/grpc/server.py:346>
Traceback (most recent call last):
File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/home/cassandra/medusa/service/grpc/server.py", line 349, in record_backup_info
if future.exception():
asyncio.exceptions.CancelledError: Removal of backup requested. Cancelling backup Name: backup-2024-04-11t08-59-44-564z with done state: False
but the backup continued and succeeded nonetheless.
We need to fix this, probably by not registering, but "re-registering", which will should probably be a new method that just registers with backupman, but does not cancel any futures.