scylla-cluster-tests
scylla-cluster-tests copied to clipboard
The `This node was decommissioned and will not rejoin the ring` RAFT error is considered an `error` instead of `warning`
Prerequisites
- [ ] Are you rebased to master ?
- [ ] Is it reproducible ?
- [ ] Did you perform a cursory search if this issue isn't opened ?
Versions
- SCT: branch-5.2
- scylla: branch-5.2
Logs
Description
Running the disrupt_decommission_streaming_err nemesis (which passed) in the scylla-5.2/longevity-10gb-3h-azure-test#27 CI job appeared following SCT error event:
2023-09-20 23:46:40.391 <2023-09-20 23:46:26.000>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=00e27382-de30-4805-ac0b-cf08519c824b: type=RUNTIME_ERROR regex=std::runtime_error line_number=37939 node=longevity-10gb-3h-5-2-db-node-a7809cd7-eastus-4
2023-09-20T23:46:26+00:00 longevity-10gb-3h-5-2-db-node-eastus-4 !ERR | scylla[1145]: [shard 0] init - Startup failed: std::runtime_error (This node was decommissioned and will not rejoin the ring unless override_decommission=true has been set,or all existing data is removed and the node is bootstrapped again)
But it is expected to be warning according to the following sdcm/utils/raft/__init__.py module code:
@staticmethod
def get_severity_change_filters_scylla_start_failed(timeout: int | None = None) -> tuple:
return (
EventsSeverityChangerFilter(new_severity=Severity.WARNING,
event_class=DatabaseLogEvent.DATABASE_ERROR,
regex=".*storage_service - decommission.*Operation failed",
extra_time_to_expiration=timeout),
EventsSeverityChangerFilter(new_severity=Severity.WARNING,
event_class=DatabaseLogEvent.DATABASE_ERROR,
regex=".*This node was decommissioned and will not rejoin the ring",
extra_time_to_expiration=timeout),
EventsSeverityChangerFilter(new_severity=Severity.WARNING,
event_class=DatabaseLogEvent.RUNTIME_ERROR,
regex=".*Startup failed: std::runtime_error.*is removed from the cluster",
extra_time_to_expiration=timeout),
EventsSeverityChangerFilter(new_severity=Severity.WARNING,
event_class=DatabaseLogEvent.DATABASE_ERROR,
regex=".*gossip - is_safe_for_restart.*status=LEFT",
extra_time_to_expiration=timeout),
EventsSeverityChangerFilter(new_severity=Severity.WARNING,
event_class=DatabaseLogEvent.RUNTIME_ERROR,
regex=".*init - Startup failed: std::runtime_error.*already exists, cancelling join",
extra_time_to_expiration=timeout)
)
Steps to Reproduce
- Run the mentioned CI job
- See error
Expected behavior: The error SCT event must be a warning one.
Actual behavior: The SCT event is of error type.
@aleksbykov ^
the event we get is RUNTIME_ERROR, while we filter DATABASE_ERROR it should be both
Should be fixed with https://github.com/scylladb/scylla-cluster-tests/pull/6360