k8ssandra-operator
k8ssandra-operator copied to clipboard
Doing a backup after a restore fails
What happened? During testing the DSE support, I ran into an issue were a backup of an already restored cluster does not happen. I did:
- Used the most recent medusa (with the DSE snapshot recursion bug) (
DefaultMedusaVersion = "c8609c8-tmp"
) - Configured a bigger minio volume (30G)
- Make a DSE cluster with
make single-up ; E2E_TEST="TestOperator/CreateSingleDseSearchDatacenterCluster" make e2e-test
, but killed it before it created any backups. - Ran a backup with 1 node cluster.
- Scaled the cluster to 3 nodes.
- Started up an ubuntu pod, installed tlp-stress, loaded some data
- Ran a few more backups
- Created an index, tested a query.
- Did one more backup.
- Did a restore.
- Confirmed a the data is back.
- Rebuilt the search index, verified the search works again.
- Did another backup, which failed. 1 node completed, 1 failed mid-way, 1 never started.
On the failing node, there was this in the medusa log:
# a lot of
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: minio-service.minio.svc.cluster.local
# then
ERROR:root:Error occurred during backup: Connection was closed before we received a valid response from endpoint URL: "http://minio-service.minio.svc.cluster.local:9000/k8ssandra-medusa/test/test-dc1-default-sts-2/data/tlp_stress/sensor_data-41ab3cd0b60f11ee8f9665415b594bf9/bb-244-bti-Data.db?uploadId=NmM2ZDlmOGYtYWZlYy00MzlhLThmMWMtYzE5NGNkMzAwMDBmLjU3MWFiMDg4LWFjMGMtNGRkZC1hOGJjLTc4YmFlMzdmMWNiMQ&partNumber=194".
[2024-01-18 15:22:03,162] ERROR: Error occurred during backup: Connection was closed before we received a valid response from endpoint URL: "http://minio-service.minio.svc.cluster.local:9000/k8ssandra-medusa/test/test-dc1-default-sts-2/data/tlp_stress/sensor_data-41ab3cd0b60f11ee8f9665415b594bf9/bb-244-bti-Data.db?uploadId=NmM2ZDlmOGYtYWZlYy00MzlhLThmMWMtYzE5NGNkMzAwMDBmLjU3MWFiMDg4LWFjMGMtNGRkZC1hOGJjLTc4YmFlMzdmMWNiMQ&partNumber=194".
Traceback (most recent call last):
File "/home/cassandra/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 670, in urlopen
httplib_response = self._make_request(
File "/home/cassandra/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 392, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/home/cassandra/.local/lib/python3.10/site-packages/botocore/awsrequest.py", line 96, in request
rval = super().request(method, url, body, headers, *args, **kwargs)
File "/usr/lib/python3.10/http/client.py", line 1283, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.10/http/client.py", line 1329, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/home/cassandra/.local/lib/python3.10/site-packages/botocore/awsrequest.py", line 130, in _send_output
self._handle_expect_response(message_body)
File "/home/cassandra/.local/lib/python3.10/site-packages/botocore/awsrequest.py", line 176, in _handle_expect_response
self._send_message_body(message_body)
File "/home/cassandra/.local/lib/python3.10/site-packages/botocore/awsrequest.py", line 209, in _send_message_body
self.send(message_body)
File "/home/cassandra/.local/lib/python3.10/site-packages/botocore/awsrequest.py", line 223, in send
return super().send(str)
File "/usr/lib/python3.10/http/client.py", line 995, in send
self.sock.sendall(datablock)
ConnectionResetError: [Errno 104] Connection reset by peer
So it seems like a closed connection, but it's unclear where in Medusa we retry to handle this.
Doing another backup after this does not work. The operator reports a started backup job, but medusa-status does not recognise the backup.
Did you expect to see something different?
How to reproduce it (as minimally and precisely as possible):
Environment
-
K8ssandra Operator version:
Insert image tag or Git SHA here
-
Kubernetes version information:
kubectl version
-
Kubernetes cluster kind:
insert how you created your cluster: kops, bootkube, etc.
-
Manifests:
insert manifests relevant to the issue
- K8ssandra Operator Logs:
insert K8ssandra Operator logs relevant to the issue here
Anything else we need to know?: