scylla-cluster-tests icon indicating copy to clipboard operation
scylla-cluster-tests copied to clipboard

[K8S][Nemesis] SCT considers an old DB log line as a new one and creates false ERROR events based on it later running other nemesis

Open vponomaryov opened this issue 2 years ago • 22 comments

Prerequisites

  • [ ] Are you rebased to master ?
  • [ ] Is it reproducible ?
  • [ ] Did you perform a cursory search if this issue isn't opened ?

Versions

  • SCT: master
  • scylla: 2023.1.1-20230906.f4633ec973b0

Logs

  • test_id: b620982f-278f-4dd1-9e4e-fae3b6f22e9c
  • job log: Argus, CI

Description

If we run the disrupt_drain_kubernetes_node_then_replace_scylla_node nemesis then we expect following DB error log lines to appear:

ERROR 2023-11-12 01:09:06,980 [shard 0] init - Startup failed: seastar::sleep_aborted (Sleep is aborted)

It is expected because the Scylla process gets interrupted during the init phase.

We ignore it with the following code in SCT:

self.log.info('Mark %s (uid=%s) to be replaced', node, old_uid)
node.wait_for_svc()
with DbEventsFilter(                 
        db_event=DatabaseLogEvent.DATABASE_ERROR,
        # NOTE: ignore following expected error messages:
        #       'init - Startup failed: seastar::sleep_aborted (Sleep is aborted)'
        #       'init - Startup failed: seastar::gate_closed_exception (gate closed)'
        line="init - Startup failed: seastar"):
    node.mark_to_be_replaced()
    self._kubernetes_wait_till_node_up_after_been_recreated(node, old_uid=old_uid)

But later, running the disrupt_terminate_kubernetes_host_then_decommission_and_add_scylla_node and the disrupt_drain_kubernetes_node_then_decommission_and_add_scylla_node nemesis we get following false SCT events:

2023-11-12 02:03:29.905: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=6ad456dd-b19d-4e5f-8f39-a30aed4b3ce6 during_nemesis=TerminateKubernetesHostThenDecommissionAndAddScyllaNode: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=4913 node=sct-cluster-2-eu-north-1-rack-1-2
ERROR 2023-11-12 01:09:06,980 [shard 0] init - Startup failed: seastar::sleep_aborted (Sleep is aborted)
...
2023-11-12 02:26:16.605: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=6ad456dd-b19d-4e5f-8f39-a30aed4b3ce6 during_nemesis=DrainKubernetesNodeThenDecommissionAndAddScyllaNode: type=DATABASE_ERROR regex=(^ERROR|!\s*?ERR).*\[shard.*\] line_number=4913 node=sct-cluster-2-eu-north-1-rack-1-2
ERROR 2023-11-12 01:09:06,980 [shard 0] init - Startup failed: seastar::sleep_aborted (Sleep is aborted)

In the above SCT events we see that the old DB log line which happened at the 01:09:06 was used for the creation of the false ERROR events.

Steps to Reproduce

  1. Run the disrupt_drain_kubernetes_node_then_replace_scylla_node nemesis somewhere in the middle of a test run
  2. Run other nemesis later such as disrupt_terminate_kubernetes_host_then_decommission_and_add_scylla_node and disrupt_drain_kubernetes_node_then_decommission_and_add_scylla_node
  3. See false SCT ERROR events

Expected behavior: SCT must not consider an old DB log line as a new one and must not create ERROR events based on it.

Actual behavior: SCT create false ERROR events based on the old expected DB log lines.

vponomaryov avatar Nov 14 '23 15:11 vponomaryov

this is weird, form the SCT logs, it look like we have two thread reading the same log...

fruch avatar Nov 14 '23 15:11 fruch

we can see:

< t:2023-11-12 02:03:23,578 f:remote_logger.py l:329  c:K8sClientLogger      p:INFO  > Starting logger for pod sct-cluster-2-eu-north-1-rack-1-2

which probably happened after logger initialization and this caused self._last_log_timestamp to be reset to empty value. In that case we reread whole log. The strange thing is that it didn't append to db log, but looks it was overwritten (while we open log in append mode).

I don't know yet why logger was reinitialized, I'll try to get back to it tomorrow.

soyacz avatar Nov 15 '23 17:11 soyacz

Ok, I understand the issue. The problem is that we decommission node (k8s) and new one gets the same name. This causes to create sdcm.db_log_reader.DbLogReader which opens already existing log and analyzes for events. It is not recent regression, previous job passed with the same doubled logs (but I don't know why we started to fail it now).

Currently I see 2 options to fix it:

  1. make new nodes in k8s get different name (I don't know if this is under our control, @vponomaryov ?)
  2. make DbLogReader to start reading from some pointer

I'm open for another options

soyacz avatar Nov 16 '23 11:11 soyacz

Ok, I understand the issue. The problem is that we decommission node (k8s) and new one gets the same name. This causes to create sdcm.db_log_reader.DbLogReader which opens already existing log and analyzes for events. It is not recent regression, previous job passed with the same doubled logs (but I don't know why we started to fail it now).

Currently I see 2 options to fix it:

  1. make new nodes in k8s get different name (I don't know if this is under our control, @vponomaryov ?)
  2. make DbLogReader to start reading from some pointer

I'm open for another options

one more options is to the move the file, so we won't loss is, but still start on a fresh file.

it's a bit more confusing, since it's not same as VMs, and we'll need to keep that in mind that we have multiple logs under the same node name.

fruch avatar Nov 16 '23 12:11 fruch

one more options is to the move the file, so we won't loss is, but still start on a fresh file.

it's a bit more confusing, since it's not same as VMs, and we'll need to keep that in mind that we have multiple logs under the same node name.

in VM's after decommission we get new node - so anyway we don't adhere to it. Seems to easy to implement, but need to remember to update also log collector for k8s

soyacz avatar Nov 16 '23 12:11 soyacz

@soyacz , @fruch this bug is one more reason to finally implement the

  • https://github.com/scylladb/qa-tasks/issues/1148

It will solve the re-read problem as well as the log read lags which cause our log-dependent nemesis to fail on the regular basis.

vponomaryov avatar Nov 16 '23 12:11 vponomaryov

@soyacz , @fruch this bug is one more reason to finally implement the

It will solve the re-read problem as well as the log read lags which cause our log-dependent nemesis to fail on the regular basis.

it would solve it if we it won't be writing to the same file, i.e. just implementing it doesn't mean would fix this issue. it's some consideration that we might want to add there, for when we will

fruch avatar Nov 16 '23 13:11 fruch

@soyacz , @fruch this bug is one more reason to finally implement the

It will solve the re-read problem as well as the log read lags which cause our log-dependent nemesis to fail on the regular basis.

it would solve it if we it won't be writing to the same file, i.e. just implementing it doesn't mean would fix this issue. it's some consideration that we might want to add there, for when we will

When DbLogReader reads a local file, it won't lose the stream session during the Scylla pod recreation event. K8S cluster will stream K8S Scylla node logs to SCT runner even if Scylla pods get recreated/fail. If a K8S Scylla node gets lost we get new one where there are no duplicated logs.

So, where my understanding is wrong? Why the implementation of the mentioned task should not fix the mentioned problems?

vponomaryov avatar Nov 16 '23 13:11 vponomaryov

@soyacz , @fruch this bug is one more reason to finally implement the

It will solve the re-read problem as well as the log read lags which cause our log-dependent nemesis to fail on the regular basis.

it would solve it if we it won't be writing to the same file, i.e. just implementing it doesn't mean would fix this issue. it's some consideration that we might want to add there, for when we will

When DbLogReader reads a local file, it won't lose the stream session during the Scylla pod recreation event. K8S cluster will stream K8S Scylla node logs to SCT runner even if Scylla pods get recreated/fail. If a K8S Scylla node gets lost we get new one where there are no duplicated logs.

So, where my understanding is wrong? Why the implementation of the mentioned task should not fix the mentioned problems?

Problem is with decommission->create new scylla node scenario where new node get's the same name (as pod does). So when new node is created it runs start_db_log_reader_thread which is causing double log in sct.log (and duplicate events) because it rereads system.log

soyacz avatar Nov 16 '23 13:11 soyacz

@soyacz , @fruch this bug is one more reason to finally implement the

It will solve the re-read problem as well as the log read lags which cause our log-dependent nemesis to fail on the regular basis.

it would solve it if we it won't be writing to the same file, i.e. just implementing it doesn't mean would fix this issue. it's some consideration that we might want to add there, for when we will

When DbLogReader reads a local file, it won't lose the stream session during the Scylla pod recreation event. K8S cluster will stream K8S Scylla node logs to SCT runner even if Scylla pods get recreated/fail. If a K8S Scylla node gets lost we get new one where there are no duplicated logs. So, where my understanding is wrong? Why the implementation of the mentioned task should not fix the mentioned problems?

Problem is with decommission->create new scylla node scenario where new node get's the same name (as pod does). So when new node is created it runs start_db_log_reader_thread which is causing double log in sct.log (and duplicate events) because it rereads system.log

New pod doesn't have logs from the previous one. We also can get logs of the previous pod using kubectl logs pod/%name% -c=scylla --previous=true ... cmd. This fact make me think that K8S node stores it separately.

So, do we really have a problem with the decommission scenario?

vponomaryov avatar Nov 16 '23 13:11 vponomaryov

@vponomaryov you are forgetting our log reading is splits into two, one part that write into a file, and one that reads. the files path stay the same, even if we recreate a pod or decommission, since the pod names stays the same, and that's the name of the directory we are writing into

fruch avatar Nov 16 '23 13:11 fruch

@vponomaryov you are forgetting our log reading is splits into two, one part that write into a file, and one that reads. the files path stay the same, even if we recreate a pod or decommission, since the pod names stays the same, and that's the name of the directory we are writing into

What will trigger the re-read if we read/write local file not losing access to it? So, the original pointer must always be in place.

vponomaryov avatar Nov 16 '23 14:11 vponomaryov

@vponomaryov you are forgetting our log reading is splits into two, one part that write into a file, and one that reads. the files path stay the same, even if we recreate a pod or decommission, since the pod names stays the same, and that's the name of the directory we are writing into

What will trigger the re-read if we read/write local file not losing access to it? So, the original pointer must always be in place.

not sure I understood your point, but if we get to a situation, we recreate the Node object, we start a new read thread, ontop of the same exact path

decommission flow, is that we kill one instance and create a new one, maybe i'm missing something on the k8s implementation in that part, but in VM that what would happen.

so there are the options:

  • not reuse the node names,
  • not reuse the same log names
  • make it that we don't create new Node object again

or I missing a possibility that failure to read, might cause us to read again from beginning, and that's a whole different issue that we'll need to keep track how much we read, in order to sync back into place if needed.

fruch avatar Nov 16 '23 14:11 fruch

@vponomaryov you are forgetting our log reading is splits into two, one part that write into a file, and one that reads. the files path stay the same, even if we recreate a pod or decommission, since the pod names stays the same, and that's the name of the directory we are writing into

What will trigger the re-read if we read/write local file not losing access to it? So, the original pointer must always be in place.

not sure I understood your point, but if we get to a situation, we recreate the Node object, we start a new read thread, ontop of the same exact path

decommission flow, is that we kill one instance and create a new one, maybe i'm missing something on the k8s implementation in that part, but in VM that what would happen.

so there are the options:

  • not reuse the node names,
  • not reuse the same log names
  • make it that we don't create new Node object again

or I missing a possibility that failure to read, might cause us to read again from beginning, and that's a whole different issue that we'll need to keep track how much we read, in order to sync back into place if needed.

There's only problem if rereading from system.log file, if pod gets created we actually want to read from it (and we do anyway).

soyacz avatar Nov 16 '23 14:11 soyacz

not sure I understood your point, but if we get to a situation, we recreate the Node object, we start a new read thread, ontop of the same exact path

decommission flow, is that we kill one instance and create a new one, maybe i'm missing something on the k8s implementation in that part, but in VM that what would happen.

We also call the stop_task_threads in case of K8S's decommission. Ok, now the problem is clear to me.

vponomaryov avatar Nov 16 '23 15:11 vponomaryov

Issue description

  • [ ] This issue is a regression.
  • [ ] It is unknown if this issue is a regression.

Running MultiDC 12h CI job the disrupt_multiple_hard_reboot_node nemesis gets run multiple times. And it causes following false error SCT events to appear much later:

2023-11-29 00:04:02.111: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=117f142d-5a7d-4ed9-a5eb-252d9ddb1747: type=RUNTIME_ERROR regex=std::runtime_error line_number=402357 node=sct-cluster-eu-north-1-rack-1-2
WARN  2023-11-28 18:45:49,861 [shard  0] commitlog_replayer - error replaying: std::runtime_error (unknown schema version e79ca8ba-6556-3f7d-925a-7f20cf57938c, table=)
...
2023-11-29 00:04:02.181: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=117f142d-5a7d-4ed9-a5eb-252d9ddb1747: type=RUNTIME_ERROR regex=std::runtime_error line_number=402412 node=sct-cluster-eu-north-1-rack-1-2
WARN  2023-11-28 18:45:49,861 [shard  0] commitlog_replayer - error replaying: std::runtime_error (unknown schema version d33236d4-9bdd-3c09-abf0-a0bc5edc2526, table=)

The diff between real event happens and it's handling by SCT is about 5h20m. So, looks like the disrupt_multiple_hard_reboot_node nemesis must be fixed the same way as it is done in here: https://github.com/scylladb/scylla-cluster-tests/pull/6816

There may be other cases when we recreate Scylla pods not gracefully.

Impact

False error SCT events

How frequently does it reproduce?

100% running 12h MultiDC job.

Installation details

Kernel Version: 5.10.198-187.748.amzn2.x86_64 Scylla version (or git commit hash): 2023.1.2-20231001.646df23cc4b3 with build-id 367fcf1672d44f5cbddc88f946cf272e2551b85a

Operator Image: scylladb/scylla-operator:latest Operator Helm Version: v1.12.0-alpha.0-123-g24389ae Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run: No resources left at the end of the run

OS / Image: `` (k8s-eks: eu-north-1, eu-west-1 )

Test: longevity-scylla-operator-multidc-12h-eks Test id: 61d76257-c01e-4e92-8908-682a75d4e7fb Test name: scylla-operator/operator-master/eks/longevity-scylla-operator-multidc-12h-eks Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 61d76257-c01e-4e92-8908-682a75d4e7fb
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 61d76257-c01e-4e92-8908-682a75d4e7fb

Logs:

Jenkins job URL Argus

vponomaryov avatar Nov 30 '23 15:11 vponomaryov

@vponomaryov can you estimate how long it would take ? looking for all of the places we might kill a node in k8s, and apply this ?

fruch avatar Nov 30 '23 16:11 fruch

@vponomaryov can you estimate how long it would take ? looking for all of the places we might kill a node in k8s, and apply this ?

It shouldn't take long. But I don't really like multiplying the solution with manual renaming of the log file. I think would be much better to do it using K8S event watcher like we do it for IP changes. When pod IP changes, we get new pod, so, we can do it there not bothering ourselves about all the cases we disrupt Scylla pods.

vponomaryov avatar Dec 02 '23 08:12 vponomaryov

@vponomaryov can you estimate how long it would take ? looking for all of the places we might kill a node in k8s, and apply this ?

It shouldn't take long. But I don't really like multiplying the solution with manual renaming of the log file. I think would be much better to do it using K8S event watcher like we do it for IP changes. When pod IP changes, we get new pod, so, we can do it there not bothering ourselves about all the cases we disrupt Scylla pods.

o.k. one way or the other, lets estimate how long it would take, and then we'll be able to consider vs. other solutions

fruch avatar Dec 03 '23 07:12 fruch

The diff between real event happens and it's handling by SCT is about 5h20m. So, looks like the disrupt_multiple_hard_reboot_node nemesis must be fixed the same way as it is done in here: #6816

There may be other cases when we recreate Scylla pods not gracefully.

I don't think we should rotate logs when we reboot instance - root cause here is different than it was fixed in #6816 . We need to find the root cause for this case.

soyacz avatar Dec 04 '23 06:12 soyacz

The diff between real event happens and it's handling by SCT is about 5h20m. So, looks like the disrupt_multiple_hard_reboot_node nemesis must be fixed the same way as it is done in here: #6816 There may be other cases when we recreate Scylla pods not gracefully.

I don't think we should rotate logs when we reboot instance - root cause here is different than it was fixed in #6816 . We need to find the root cause for this case.

@vponomaryov is correct, that every time k8s is regenerating the pod, we'll see this, and we have lot of actions that would cause it, his idea would be better covering all of them, but a bit more complex to implement

fruch avatar Dec 04 '23 07:12 fruch

We need to understand, the actual source of the issue, before we can decide on a proper fix for it.

fruch avatar Dec 04 '23 10:12 fruch