scylla-cluster-tests icon indicating copy to clipboard operation
scylla-cluster-tests copied to clipboard

The `This node was decommissioned and will not rejoin the ring` RAFT error is considered an `error` instead of `warning`

Open vponomaryov opened this issue 2 years ago • 3 comments

Prerequisites

  • [ ] Are you rebased to master ?
  • [ ] Is it reproducible ?
  • [ ] Did you perform a cursory search if this issue isn't opened ?

Versions

  • SCT: branch-5.2
  • scylla: branch-5.2

Logs

  • test_id: a7809cd7-8c52-4c96-9482-e786811330cf
  • job log: Argus, CI

Description

Running the disrupt_decommission_streaming_err nemesis (which passed) in the scylla-5.2/longevity-10gb-3h-azure-test#27 CI job appeared following SCT error event:

2023-09-20 23:46:40.391 <2023-09-20 23:46:26.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=00e27382-de30-4805-ac0b-cf08519c824b: type=RUNTIME_ERROR regex=std::runtime_error line_number=37939 node=longevity-10gb-3h-5-2-db-node-a7809cd7-eastus-4
2023-09-20T23:46:26+00:00 longevity-10gb-3h-5-2-db-node-eastus-4      !ERR | scylla[1145]:  [shard 0] init - Startup failed: std::runtime_error (This node was decommissioned and will not rejoin the ring unless override_decommission=true has been set,or all existing data is removed and the node is bootstrapped again)

But it is expected to be warning according to the following sdcm/utils/raft/__init__.py module code:

    @staticmethod                                                                                      
    def get_severity_change_filters_scylla_start_failed(timeout: int | None = None) -&gt; tuple:          
        return (                                                                                       
            EventsSeverityChangerFilter(new_severity=Severity.WARNING,                                 
                                        event_class=DatabaseLogEvent.DATABASE_ERROR,                   
                                        regex=".*storage_service - decommission.*Operation failed", 
                                        extra_time_to_expiration=timeout),                             
            EventsSeverityChangerFilter(new_severity=Severity.WARNING,                                 
                                        event_class=DatabaseLogEvent.DATABASE_ERROR,                   
                                        regex=".*This node was decommissioned and will not rejoin the ring",
                                        extra_time_to_expiration=timeout),                             
            EventsSeverityChangerFilter(new_severity=Severity.WARNING,                                 
                                        event_class=DatabaseLogEvent.RUNTIME_ERROR,                    
                                        regex=".*Startup failed: std::runtime_error.*is removed from the cluster",
                                        extra_time_to_expiration=timeout),                             
            EventsSeverityChangerFilter(new_severity=Severity.WARNING,                                 
                                        event_class=DatabaseLogEvent.DATABASE_ERROR,                   
                                        regex=".*gossip - is_safe_for_restart.*status=LEFT",           
                                        extra_time_to_expiration=timeout),                             
            EventsSeverityChangerFilter(new_severity=Severity.WARNING,                                 
                                        event_class=DatabaseLogEvent.RUNTIME_ERROR,                    
                                        regex=".*init - Startup failed: std::runtime_error.*already exists, cancelling join",
                                        extra_time_to_expiration=timeout)                              
        )

Steps to Reproduce

  1. Run the mentioned CI job
  2. See error

Expected behavior: The error SCT event must be a warning one.

Actual behavior: The SCT event is of error type.

vponomaryov avatar Sep 22 '23 17:09 vponomaryov

@aleksbykov ^

vponomaryov avatar Sep 22 '23 17:09 vponomaryov

the event we get is RUNTIME_ERROR, while we filter DATABASE_ERROR it should be both

fruch avatar Sep 26 '23 22:09 fruch

Should be fixed with https://github.com/scylladb/scylla-cluster-tests/pull/6360

aleksbykov avatar Sep 29 '23 08:09 aleksbykov