scylla-cluster-tests
scylla-cluster-tests copied to clipboard
[K8S][Nemesis] SCT considers an old DB log line as a new one and creates false ERROR events based on it later running other nemesis
Prerequisites
- [ ] Are you rebased to master ?
- [ ] Is it reproducible ?
- [ ] Did you perform a cursory search if this issue isn't opened ?
Versions
- SCT: master
- scylla: 2023.1.1-20230906.f4633ec973b0
Logs
Description
If we run the disrupt_drain_kubernetes_node_then_replace_scylla_node nemesis then we expect following DB error log lines to appear:
ERROR 2023-11-12 01:09:06,980 [shard 0] init - Startup failed: seastar::sleep_aborted (Sleep is aborted)
It is expected because the Scylla process gets interrupted during the init phase.
We ignore it with the following code in SCT:
self.log.info('Mark %s (uid=%s) to be replaced', node, old_uid)
node.wait_for_svc()
with DbEventsFilter(
db_event=DatabaseLogEvent.DATABASE_ERROR,
# NOTE: ignore following expected error messages:
# 'init - Startup failed: seastar::sleep_aborted (Sleep is aborted)'
# 'init - Startup failed: seastar::gate_closed_exception (gate closed)'
line="init - Startup failed: seastar"):
node.mark_to_be_replaced()
self._kubernetes_wait_till_node_up_after_been_recreated(node, old_uid=old_uid)
But later, running the disrupt_terminate_kubernetes_host_then_decommission_and_add_scylla_node and the disrupt_drain_kubernetes_node_then_decommission_and_add_scylla_node nemesis we get following false SCT events:
2023-11-12 02:03:29.905: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=6ad456dd-b19d-4e5f-8f39-a30aed4b3ce6 during_nemesis=TerminateKubernetesHostThenDecommissionAndAddScyllaNode: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=4913 node=sct-cluster-2-eu-north-1-rack-1-2
ERROR 2023-11-12 01:09:06,980 [shard 0] init - Startup failed: seastar::sleep_aborted (Sleep is aborted)
...
2023-11-12 02:26:16.605: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=6ad456dd-b19d-4e5f-8f39-a30aed4b3ce6 during_nemesis=DrainKubernetesNodeThenDecommissionAndAddScyllaNode: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=4913 node=sct-cluster-2-eu-north-1-rack-1-2
ERROR 2023-11-12 01:09:06,980 [shard 0] init - Startup failed: seastar::sleep_aborted (Sleep is aborted)
In the above SCT events we see that the old DB log line which happened at the 01:09:06 was used for the creation of the false ERROR events.
Steps to Reproduce
- Run the
disrupt_drain_kubernetes_node_then_replace_scylla_nodenemesis somewhere in the middle of a test run - Run other nemesis later such as
disrupt_terminate_kubernetes_host_then_decommission_and_add_scylla_nodeanddisrupt_drain_kubernetes_node_then_decommission_and_add_scylla_node - See false SCT ERROR events
Expected behavior: SCT must not consider an old DB log line as a new one and must not create ERROR events based on it.
Actual behavior: SCT create false ERROR events based on the old expected DB log lines.
this is weird, form the SCT logs, it look like we have two thread reading the same log...
we can see:
< t:2023-11-12 02:03:23,578 f:remote_logger.py l:329 c:K8sClientLogger p:INFO > Starting logger for pod sct-cluster-2-eu-north-1-rack-1-2
which probably happened after logger initialization and this caused self._last_log_timestamp to be reset to empty value. In that case we reread whole log.
The strange thing is that it didn't append to db log, but looks it was overwritten (while we open log in append mode).
I don't know yet why logger was reinitialized, I'll try to get back to it tomorrow.
Ok, I understand the issue.
The problem is that we decommission node (k8s) and new one gets the same name. This causes to create sdcm.db_log_reader.DbLogReader which opens already existing log and analyzes for events.
It is not recent regression, previous job passed with the same doubled logs (but I don't know why we started to fail it now).
Currently I see 2 options to fix it:
- make new nodes in k8s get different name (I don't know if this is under our control, @vponomaryov ?)
- make DbLogReader to start reading from some pointer
I'm open for another options
Ok, I understand the issue. The problem is that we decommission node (k8s) and new one gets the same name. This causes to create
sdcm.db_log_reader.DbLogReaderwhich opens already existing log and analyzes for events. It is not recent regression, previous job passed with the same doubled logs (but I don't know why we started to fail it now).Currently I see 2 options to fix it:
- make new nodes in k8s get different name (I don't know if this is under our control, @vponomaryov ?)
- make DbLogReader to start reading from some pointer
I'm open for another options
one more options is to the move the file, so we won't loss is, but still start on a fresh file.
it's a bit more confusing, since it's not same as VMs, and we'll need to keep that in mind that we have multiple logs under the same node name.
one more options is to the move the file, so we won't loss is, but still start on a fresh file.
it's a bit more confusing, since it's not same as VMs, and we'll need to keep that in mind that we have multiple logs under the same node name.
in VM's after decommission we get new node - so anyway we don't adhere to it. Seems to easy to implement, but need to remember to update also log collector for k8s
@soyacz , @fruch this bug is one more reason to finally implement the
- https://github.com/scylladb/qa-tasks/issues/1148
It will solve the re-read problem as well as the log read lags which cause our log-dependent nemesis to fail on the regular basis.
@soyacz , @fruch this bug is one more reason to finally implement the
It will solve the
re-readproblem as well as thelog read lagswhich cause our log-dependent nemesis to fail on the regular basis.
it would solve it if we it won't be writing to the same file, i.e. just implementing it doesn't mean would fix this issue. it's some consideration that we might want to add there, for when we will
@soyacz , @fruch this bug is one more reason to finally implement the
It will solve the
re-readproblem as well as thelog read lagswhich cause our log-dependent nemesis to fail on the regular basis.it would solve it if we it won't be writing to the same file, i.e. just implementing it doesn't mean would fix this issue. it's some consideration that we might want to add there, for when we will
When DbLogReader reads a local file, it won't lose the stream session during the Scylla pod recreation event.
K8S cluster will stream K8S Scylla node logs to SCT runner even if Scylla pods get recreated/fail.
If a K8S Scylla node gets lost we get new one where there are no duplicated logs.
So, where my understanding is wrong? Why the implementation of the mentioned task should not fix the mentioned problems?
@soyacz , @fruch this bug is one more reason to finally implement the
It will solve the
re-readproblem as well as thelog read lagswhich cause our log-dependent nemesis to fail on the regular basis.it would solve it if we it won't be writing to the same file, i.e. just implementing it doesn't mean would fix this issue. it's some consideration that we might want to add there, for when we will
When
DbLogReaderreads a local file, it won't lose the stream session during the Scylla pod recreation event. K8S cluster will stream K8S Scylla node logs to SCT runner even if Scylla pods get recreated/fail. If a K8S Scylla node gets lost we get new one where there are no duplicated logs.So, where my understanding is wrong? Why the implementation of the mentioned task should not fix the mentioned problems?
Problem is with decommission->create new scylla node scenario where new node get's the same name (as pod does). So when new node is created it runs start_db_log_reader_thread which is causing double log in sct.log (and duplicate events) because it rereads system.log
@soyacz , @fruch this bug is one more reason to finally implement the
It will solve the
re-readproblem as well as thelog read lagswhich cause our log-dependent nemesis to fail on the regular basis.it would solve it if we it won't be writing to the same file, i.e. just implementing it doesn't mean would fix this issue. it's some consideration that we might want to add there, for when we will
When
DbLogReaderreads a local file, it won't lose the stream session during the Scylla pod recreation event. K8S cluster will stream K8S Scylla node logs to SCT runner even if Scylla pods get recreated/fail. If a K8S Scylla node gets lost we get new one where there are no duplicated logs. So, where my understanding is wrong? Why the implementation of the mentioned task should not fix the mentioned problems?Problem is with decommission->create new scylla node scenario where new node get's the same name (as pod does). So when new node is created it runs
start_db_log_reader_threadwhich is causing double log in sct.log (and duplicate events) because it rereads system.log
New pod doesn't have logs from the previous one.
We also can get logs of the previous pod using kubectl logs pod/%name% -c=scylla --previous=true ... cmd.
This fact make me think that K8S node stores it separately.
So, do we really have a problem with the decommission scenario?
@vponomaryov you are forgetting our log reading is splits into two, one part that write into a file, and one that reads. the files path stay the same, even if we recreate a pod or decommission, since the pod names stays the same, and that's the name of the directory we are writing into
@vponomaryov you are forgetting our log reading is splits into two, one part that write into a file, and one that reads. the files path stay the same, even if we recreate a pod or decommission, since the pod names stays the same, and that's the name of the directory we are writing into
What will trigger the re-read if we read/write local file not losing access to it?
So, the original pointer must always be in place.
@vponomaryov you are forgetting our log reading is splits into two, one part that write into a file, and one that reads. the files path stay the same, even if we recreate a pod or decommission, since the pod names stays the same, and that's the name of the directory we are writing into
What will trigger the
re-readif we read/write local file not losing access to it? So, the original pointer must always be in place.
not sure I understood your point, but if we get to a situation, we recreate the Node object, we start a new read thread, ontop of the same exact path
decommission flow, is that we kill one instance and create a new one, maybe i'm missing something on the k8s implementation in that part, but in VM that what would happen.
so there are the options:
- not reuse the node names,
- not reuse the same log names
- make it that we don't create new Node object again
or I missing a possibility that failure to read, might cause us to read again from beginning, and that's a whole different issue that we'll need to keep track how much we read, in order to sync back into place if needed.
@vponomaryov you are forgetting our log reading is splits into two, one part that write into a file, and one that reads. the files path stay the same, even if we recreate a pod or decommission, since the pod names stays the same, and that's the name of the directory we are writing into
What will trigger the
re-readif we read/write local file not losing access to it? So, the original pointer must always be in place.not sure I understood your point, but if we get to a situation, we recreate the Node object, we start a new read thread, ontop of the same exact path
decommission flow, is that we kill one instance and create a new one, maybe i'm missing something on the k8s implementation in that part, but in VM that what would happen.
so there are the options:
- not reuse the node names,
- not reuse the same log names
- make it that we don't create new Node object again
or I missing a possibility that failure to read, might cause us to read again from beginning, and that's a whole different issue that we'll need to keep track how much we read, in order to sync back into place if needed.
There's only problem if rereading from system.log file, if pod gets created we actually want to read from it (and we do anyway).
not sure I understood your point, but if we get to a situation, we recreate the Node object, we start a new read thread, ontop of the same exact path
decommission flow, is that we kill one instance and create a new one, maybe i'm missing something on the k8s implementation in that part, but in VM that what would happen.
We also call the stop_task_threads in case of K8S's decommission.
Ok, now the problem is clear to me.
Issue description
- [ ] This issue is a regression.
- [ ] It is unknown if this issue is a regression.
Running MultiDC 12h CI job the disrupt_multiple_hard_reboot_node nemesis gets run multiple times.
And it causes following false error SCT events to appear much later:
2023-11-29 00:04:02.111: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=117f142d-5a7d-4ed9-a5eb-252d9ddb1747: type=RUNTIME_ERROR regex=std::runtime_error line_number=402357 node=sct-cluster-eu-north-1-rack-1-2
WARN 2023-11-28 18:45:49,861 [shard 0] commitlog_replayer - error replaying: std::runtime_error (unknown schema version e79ca8ba-6556-3f7d-925a-7f20cf57938c, table=)
...
2023-11-29 00:04:02.181: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=117f142d-5a7d-4ed9-a5eb-252d9ddb1747: type=RUNTIME_ERROR regex=std::runtime_error line_number=402412 node=sct-cluster-eu-north-1-rack-1-2
WARN 2023-11-28 18:45:49,861 [shard 0] commitlog_replayer - error replaying: std::runtime_error (unknown schema version d33236d4-9bdd-3c09-abf0-a0bc5edc2526, table=)
The diff between real event happens and it's handling by SCT is about 5h20m.
So, looks like the disrupt_multiple_hard_reboot_node nemesis must be fixed the same way as it is done in here: https://github.com/scylladb/scylla-cluster-tests/pull/6816
There may be other cases when we recreate Scylla pods not gracefully.
Impact
False error SCT events
How frequently does it reproduce?
100% running 12h MultiDC job.
Installation details
Kernel Version: 5.10.198-187.748.amzn2.x86_64
Scylla version (or git commit hash): 2023.1.2-20231001.646df23cc4b3 with build-id 367fcf1672d44f5cbddc88f946cf272e2551b85a
Operator Image: scylladb/scylla-operator:latest Operator Helm Version: v1.12.0-alpha.0-123-g24389ae Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (i4i.4xlarge)
Scylla Nodes used in this run: No resources left at the end of the run
OS / Image: `` (k8s-eks: eu-north-1, eu-west-1 )
Test: longevity-scylla-operator-multidc-12h-eks
Test id: 61d76257-c01e-4e92-8908-682a75d4e7fb
Test name: scylla-operator/operator-master/eks/longevity-scylla-operator-multidc-12h-eks
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor 61d76257-c01e-4e92-8908-682a75d4e7fb - Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs 61d76257-c01e-4e92-8908-682a75d4e7fb
Logs:
- kubernetes-61d76257.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/61d76257-c01e-4e92-8908-682a75d4e7fb/20231129_014613/kubernetes-61d76257.tar.gz
- kubernetes-must-gather-61d76257.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/61d76257-c01e-4e92-8908-682a75d4e7fb/20231129_014613/kubernetes-must-gather-61d76257.tar.gz
- db-cluster-61d76257.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/61d76257-c01e-4e92-8908-682a75d4e7fb/20231129_014613/db-cluster-61d76257.tar.gz
- sct-runner-events-61d76257.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/61d76257-c01e-4e92-8908-682a75d4e7fb/20231129_014613/sct-runner-events-61d76257.tar.gz
- sct-61d76257.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/61d76257-c01e-4e92-8908-682a75d4e7fb/20231129_014613/sct-61d76257.log.tar.gz
- loader-set-61d76257.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/61d76257-c01e-4e92-8908-682a75d4e7fb/20231129_014613/loader-set-61d76257.tar.gz
- monitor-set-61d76257.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/61d76257-c01e-4e92-8908-682a75d4e7fb/20231129_014613/monitor-set-61d76257.tar.gz
@vponomaryov can you estimate how long it would take ? looking for all of the places we might kill a node in k8s, and apply this ?
@vponomaryov can you estimate how long it would take ? looking for all of the places we might kill a node in k8s, and apply this ?
It shouldn't take long. But I don't really like multiplying the solution with manual renaming of the log file. I think would be much better to do it using K8S event watcher like we do it for IP changes. When pod IP changes, we get new pod, so, we can do it there not bothering ourselves about all the cases we disrupt Scylla pods.
@vponomaryov can you estimate how long it would take ? looking for all of the places we might kill a node in k8s, and apply this ?
It shouldn't take long. But I don't really like multiplying the solution with manual renaming of the log file. I think would be much better to do it using K8S event watcher like we do it for IP changes. When pod IP changes, we get new pod, so, we can do it there not bothering ourselves about all the cases we disrupt Scylla pods.
o.k. one way or the other, lets estimate how long it would take, and then we'll be able to consider vs. other solutions
The diff between real event happens and it's handling by SCT is about
5h20m. So, looks like thedisrupt_multiple_hard_reboot_nodenemesis must be fixed the same way as it is done in here: #6816There may be other cases when we recreate Scylla pods not gracefully.
I don't think we should rotate logs when we reboot instance - root cause here is different than it was fixed in #6816 . We need to find the root cause for this case.
The diff between real event happens and it's handling by SCT is about
5h20m. So, looks like thedisrupt_multiple_hard_reboot_nodenemesis must be fixed the same way as it is done in here: #6816 There may be other cases when we recreate Scylla pods not gracefully.I don't think we should rotate logs when we reboot instance - root cause here is different than it was fixed in #6816 . We need to find the root cause for this case.
@vponomaryov is correct, that every time k8s is regenerating the pod, we'll see this, and we have lot of actions that would cause it, his idea would be better covering all of them, but a bit more complex to implement
We need to understand, the actual source of the issue, before we can decide on a proper fix for it.