scylla-cluster-tests icon indicating copy to clipboard operation
scylla-cluster-tests copied to clipboard

Loader node unavailable with Error: EC2RoleRequestError: no EC2 instance role found [Gemini OOM]

Open cezarmoise opened this issue 10 months ago • 3 comments

Packages

Scylla version: 2024.2.4-20250124.2bc4ec25a8db with build-id 600f7eab617a1f6b1919ae67f4164635887a00ee

Kernel Version: 5.15.0-1076-aws

Issue description

Multiple nemisis fail due to loader node being unavailable Failed to connect in 60 seconds, last error: (ConnectError)Error connecting to host '10.12.3.73:22' - timed out .

Loader log ends at 2025-01-24T08:21:53.642+00:00, errors start happening later

2025-01-24 09:38:07.150: (DisruptionEvent Severity.ERROR) period_type=end event_id=594d99c3-2fc6-47a6-9045-52a8b03a71cc duration=7m0s: nemesis_name=Truncate target_node=Node gemini-with-nemesis-3h-normal-2024--db-node-bdf78932-1 [18.207.106.123 | 10.12.0.68] errors=Failed to run a command due to exception!
Command: 'touch $HOME/cs-hdr-write-l1-c0-k1-728e46c6-9f93-4eb1-a51b-01f7e03e6f22.hdr'
Stdout:
Stderr:
Exception:  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 593, in run
self.connect()
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 529, in connect
raise ConnectTimeout(ex_msg) from exc
Failed to connect in 60 seconds, last error: (ConnectError)Error connecting to host '10.12.3.73:22' - timed out
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 417, in _init_socket
self.sock.connect((host, port))
TimeoutError: timed out
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 520, in connect
self._connect()
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 535, in _connect
self._init_socket(self.host, self.port)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 422, in _init_socket
raise ConnectError("Error connecting to host '%s:%s' - %s" % (host, port, str(error_type))) from ex
sdcm.remote.libssh2_client.exceptions.ConnectError: Error connecting to host '10.12.3.73:22' - timed out
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 593, in run
self.connect()
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 529, in connect
raise ConnectTimeout(ex_msg) from exc
sdcm.remote.libssh2_client.exceptions.ConnectTimeout: Failed to connect in 60 seconds, last error: (ConnectError)Error connecting to host '10.12.3.73:22' - timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 605, in _run
return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 538, in _run_execute
result = connection.run(**command_kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 596, in run
return self._complete_run(
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 655, in _complete_run
raise exception
sdcm.remote.libssh2_client.exceptions.FailedToRunCommand: Failed to run a command due to exception!
Command: 'touch $HOME/cs-hdr-write-l1-c0-k1-728e46c6-9f93-4eb1-a51b-01f7e03e6f22.hdr'
Stdout:
Stderr:
Exception:  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 593, in run
self.connect()
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 529, in connect
raise ConnectTimeout(ex_msg) from exc
Failed to connect in 60 seconds, last error: (ConnectError)Error connecting to host '10.12.3.73:22' - timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5309, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2047, in disrupt_truncate
self._prepare_test_table(ks=keyspace_truncate)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2014, in _prepare_test_table
cs_thread.verify_results()
File "/home/ubuntu/scylla-cluster-tests/sdcm/stress_thread.py", line 482, in verify_results
results = super().get_results()
File "/home/ubuntu/scylla-cluster-tests/sdcm/stress/base.py", line 94, in get_results
results.append(future.result())
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/stress_thread.py", line 348, in _run_cs_stress
loader.remoter.run(f"touch $HOME/{remote_hdr_file_name}", ignore_status=False, verbose=False)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 614, in run
result = _run()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 72, in inner
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 607, in _run
if self._run_on_retryable_exception(exc, new_session):
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_libssh_cmd_runner.py", line 78, in _run_on_retryable_exception
raise RetryableNetworkException(str(exc), original=exc)
sdcm.remote.base.RetryableNetworkException: Failed to run a command due to exception!
Command: 'touch $HOME/cs-hdr-write-l1-c0-k1-728e46c6-9f93-4eb1-a51b-01f7e03e6f22.hdr'
Stdout:
Stderr:
Exception:  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 593, in run
self.connect()
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 529, in connect
raise ConnectTimeout(ex_msg) from exc
Failed to connect in 60 seconds, last error: (ConnectError)Error connecting to host '10.12.3.73:22' - timed out

Loader log

2025-01-24T07:51:36.865+00:00 gemini-with-nemesis-3h-normal-2024--loader-node-bdf78932-1     !INFO | amazon-ssm-agent.amazon-ssm-agent[486]: 2025-01-24 07:51:36 WARN EC2RoleProvider Failed to connect to Systems Manager with instance profile role credentials. Err: retrieved credentials failed to report to ssm. Error: EC2RoleRequestError: no EC2 instance role found
2025-01-24T07:51:36.865+00:00 gemini-with-nemesis-3h-normal-2024--loader-node-bdf78932-1     !INFO | amazon-ssm-agent.amazon-ssm-agent[486]: caused by: EC2MetadataError: failed to make EC2Metadata request
2025-01-24T07:51:36.865+00:00 gemini-with-nemesis-3h-normal-2024--loader-node-bdf78932-1     !INFO | amazon-ssm-agent.amazon-ssm-agent[486]: <?xml version="1.0" encoding="iso-8859-1"?>
2025-01-24T07:51:36.865+00:00 gemini-with-nemesis-3h-normal-2024--loader-node-bdf78932-1     !INFO | amazon-ssm-agent.amazon-ssm-agent[486]: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
2025-01-24T07:51:36.865+00:00 gemini-with-nemesis-3h-normal-2024--loader-node-bdf78932-1     !INFO | amazon-ssm-agent.amazon-ssm-agent[486]: 		 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
2025-01-24T07:51:36.869+00:00 gemini-with-nemesis-3h-normal-2024--loader-node-bdf78932-1     !INFO | amazon-ssm-agent.amazon-ssm-agent[486]: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
2025-01-24T07:51:36.869+00:00 gemini-with-nemesis-3h-normal-2024--loader-node-bdf78932-1     !INFO | amazon-ssm-agent.amazon-ssm-agent[486]:  <head>
2025-01-24T07:51:36.869+00:00 gemini-with-nemesis-3h-normal-2024--loader-node-bdf78932-1     !INFO | amazon-ssm-agent.amazon-ssm-agent[486]:   <title>404 - Not Found</title>
2025-01-24T07:51:36.869+00:00 gemini-with-nemesis-3h-normal-2024--loader-node-bdf78932-1     !INFO | amazon-ssm-agent.amazon-ssm-agent[486]:  </head>
2025-01-24T07:51:36.869+00:00 gemini-with-nemesis-3h-normal-2024--loader-node-bdf78932-1     !INFO | amazon-ssm-agent.amazon-ssm-agent[486]:  <body>
2025-01-24T07:51:36.869+00:00 gemini-with-nemesis-3h-normal-2024--loader-node-bdf78932-1     !INFO | amazon-ssm-agent.amazon-ssm-agent[486]:   <h1>404 - Not Found</h1>
2025-01-24T07:51:36.869+00:00 gemini-with-nemesis-3h-normal-2024--loader-node-bdf78932-1     !INFO | amazon-ssm-agent.amazon-ssm-agent[486]:  </body>
2025-01-24T07:51:36.869+00:00 gemini-with-nemesis-3h-normal-2024--loader-node-bdf78932-1     !INFO | amazon-ssm-agent.amazon-ssm-agent[486]: </html>
2025-01-24T07:51:36.869+00:00 gemini-with-nemesis-3h-normal-2024--loader-node-bdf78932-1     !INFO | amazon-ssm-agent.amazon-ssm-agent[486]: 	status code: 404, request id:
2025-01-24T07:51:36.869+00:00 gemini-with-nemesis-3h-normal-2024--loader-node-bdf78932-1     !INFO | amazon-ssm-agent.amazon-ssm-agent[486]: 2025-01-24 07:51:36 ERROR EC2RoleProvider Failed to connect to Systems Manager with SSM role credentials. error calling RequestManagedInstanceRoleToken: AccessDeniedException: Systems Manager's instance management role is not configured for account: 797456418907
2025-01-24T07:51:36.869+00:00 gemini-with-nemesis-3h-normal-2024--loader-node-bdf78932-1     !INFO | amazon-ssm-agent.amazon-ssm-agent[486]: 	status code: 400, request id: 58e4f863-d2b6-46a8-8372-85eb10bd7c0f
2025-01-24T07:55:29.865+00:00 gemini-with-nemesis-3h-normal-2024--loader-node-bdf78932-1     !INFO | sshd[6019]: Timeout, client not responding from user ubuntu 10.12.0.59 port 52472
2025-01-24T07:55:29.865+00:00 gemini-with-nemesis-3h-normal-2024--loader-node-bdf78932-1     !INFO | sshd[5942]: pam_unix(sshd:session): session closed for user ubuntu
2025-01-24T07:55:29.865+00:00 gemini-with-nemesis-3h-normal-2024--loader-node-bdf78932-1     !INFO | systemd[1]: session-7.scope: Deactivated successfully.
2025-01-24T07:55:29.865+00:00 gemini-with-nemesis-3h-normal-2024--loader-node-bdf78932-1     !INFO | systemd-logind[490]: Session 7 logged out. Waiting for processes to exit.
2025-01-24T07:55:29.866+00:00 gemini-with-nemesis-3h-normal-2024--loader-node-bdf78932-1     !INFO | systemd-logind[490]: Removed session 7.
2025-01-24T08:21:53.642+00:00 gemini-with-nemesis-3h-normal-2024--loader-node-bdf78932-1     !INFO | -- MARK --

Installation details

Cluster size: 4 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

  • gemini-with-nemesis-3h-normal-2024--oracle-db-node-bdf78932-1 (44.203.10.239 | 10.12.1.228) (shards: 30)
  • gemini-with-nemesis-3h-normal-2024--db-node-bdf78932-5 (44.203.226.195 | 10.12.1.232) (shards: 7)
  • gemini-with-nemesis-3h-normal-2024--db-node-bdf78932-4 (3.236.190.217 | 10.12.0.161) (shards: 7)
  • gemini-with-nemesis-3h-normal-2024--db-node-bdf78932-3 (44.202.185.17 | 10.12.0.229) (shards: 7)
  • gemini-with-nemesis-3h-normal-2024--db-node-bdf78932-2 (44.193.199.109 | 10.12.2.53) (shards: 7)
  • gemini-with-nemesis-3h-normal-2024--db-node-bdf78932-1 (18.207.106.123 | 10.12.0.68) (shards: 7)

OS / Image: ami-056edd672f7577fac (NO RUNNER: NO RUNNER)

Test: gemini-3h-with-nemesis-test Test id: bdf78932-10bb-43ca-8446-a5785a24e888 Test name: enterprise-2024.2/gemini/gemini-3h-with-nemesis-test Test method: gemini_test.GeminiTest.test_load_random_with_nemesis Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor bdf78932-10bb-43ca-8446-a5785a24e888
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs bdf78932-10bb-43ca-8446-a5785a24e888

Logs:

Jenkins job URL Argus

cezarmoise avatar Jan 27 '25 16:01 cezarmoise

this happened due a memory leak. Probably gemini.

Image

soyacz avatar Jan 28 '25 08:01 soyacz

it's gemini for sure, but we are not planning to backport the new gemini version with fixes to that branch

fruch avatar Feb 24 '25 08:02 fruch

@CodeLieutenant putting this into queue, as a reminder for the releases backport we need

fruch avatar Mar 13 '25 10:03 fruch

@cezarmoise @fruch @soyacz I'm closing this one, as gemini has been backported to all relevant branches and no leak is seen, if it seen we can open a new issue or reopen this one

CodeLieutenant avatar Jun 11 '25 13:06 CodeLieutenant