k8ssandra-operator Doing a backup after a restore fails

Doing a backup after a restore fails

Open rzvoncek opened this issue 1 year ago • 0 comments

What happened? During testing the DSE support, I ran into an issue were a backup of an already restored cluster does not happen. I did:

Used the most recent medusa (with the DSE snapshot recursion bug) (DefaultMedusaVersion = "c8609c8-tmp")
Configured a bigger minio volume (30G)
Make a DSE cluster with make single-up ; E2E_TEST="TestOperator/CreateSingleDseSearchDatacenterCluster" make e2e-test, but killed it before it created any backups.
Ran a backup with 1 node cluster.
Scaled the cluster to 3 nodes.
Started up an ubuntu pod, installed tlp-stress, loaded some data
Ran a few more backups
Created an index, tested a query.
Did one more backup.
Did a restore.
Confirmed a the data is back.
Rebuilt the search index, verified the search works again.
Did another backup, which failed. 1 node completed, 1 failed mid-way, 1 never started.

On the failing node, there was this in the medusa log:

# a lot of
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: minio-service.minio.svc.cluster.local
# then
ERROR:root:Error occurred during backup: Connection was closed before we received a valid response from endpoint URL: "http://minio-service.minio.svc.cluster.local:9000/k8ssandra-medusa/test/test-dc1-default-sts-2/data/tlp_stress/sensor_data-41ab3cd0b60f11ee8f9665415b594bf9/bb-244-bti-Data.db?uploadId=NmM2ZDlmOGYtYWZlYy00MzlhLThmMWMtYzE5NGNkMzAwMDBmLjU3MWFiMDg4LWFjMGMtNGRkZC1hOGJjLTc4YmFlMzdmMWNiMQ&partNumber=194".
[2024-01-18 15:22:03,162] ERROR: Error occurred during backup: Connection was closed before we received a valid response from endpoint URL: "http://minio-service.minio.svc.cluster.local:9000/k8ssandra-medusa/test/test-dc1-default-sts-2/data/tlp_stress/sensor_data-41ab3cd0b60f11ee8f9665415b594bf9/bb-244-bti-Data.db?uploadId=NmM2ZDlmOGYtYWZlYy00MzlhLThmMWMtYzE5NGNkMzAwMDBmLjU3MWFiMDg4LWFjMGMtNGRkZC1hOGJjLTc4YmFlMzdmMWNiMQ&partNumber=194".
Traceback (most recent call last):
  File "/home/cassandra/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "/home/cassandra/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/home/cassandra/.local/lib/python3.10/site-packages/botocore/awsrequest.py", line 96, in request
    rval = super().request(method, url, body, headers, *args, **kwargs)
  File "/usr/lib/python3.10/http/client.py", line 1283, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1329, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/cassandra/.local/lib/python3.10/site-packages/botocore/awsrequest.py", line 130, in _send_output
    self._handle_expect_response(message_body)
  File "/home/cassandra/.local/lib/python3.10/site-packages/botocore/awsrequest.py", line 176, in _handle_expect_response
    self._send_message_body(message_body)
  File "/home/cassandra/.local/lib/python3.10/site-packages/botocore/awsrequest.py", line 209, in _send_message_body
    self.send(message_body)
  File "/home/cassandra/.local/lib/python3.10/site-packages/botocore/awsrequest.py", line 223, in send
    return super().send(str)
  File "/usr/lib/python3.10/http/client.py", line 995, in send
    self.sock.sendall(datablock)
ConnectionResetError: [Errno 104] Connection reset by peer

So it seems like a closed connection, but it's unclear where in Medusa we retry to handle this.

Doing another backup after this does not work. The operator reports a started backup job, but medusa-status does not recognise the backup.

Did you expect to see something different?

How to reproduce it (as minimally and precisely as possible):

Environment

K8ssandra Operator version:

Insert image tag or Git SHA here
Kubernetes version information:

kubectl version
Kubernetes cluster kind:

insert how you created your cluster: kops, bootkube, etc.
Manifests:

insert manifests relevant to the issue

K8ssandra Operator Logs:

insert K8ssandra Operator logs relevant to the issue here

Anything else we need to know?:

Jan 18 '24 15:01 rzvoncek

k8ssandra-operator k8ssandra-operator copied to clipboard

Doing a backup after a restore fails

k8ssandra-operator
k8ssandra-operator copied to clipboard