scylla-manager icon indicating copy to clipboard operation
scylla-manager copied to clipboard

Add a job in SM to collect Scylla Doctor from whole cluster

Open tarzanek opened this issue 9 months ago • 9 comments

SM and its agents can be leveraged to collect from central place the state of all nodes using Scylla Doctor ( https://github.com/scylladb/scylla-doctor )

this will help support to properly and quickly help customers to verify their clusters and their health and ev. config drifts

Can a job be added to run SD on all nodes and collect its outputs?

tarzanek avatar Feb 18 '25 13:02 tarzanek

@karol-kokoszka can you triage? we can certainly add knowledge how to run SD or internal ways how this is gathered for Scylla Cloud or others

tarzanek avatar Feb 18 '25 13:02 tarzanek

@tarzanek How to call Scylla-Doctor ? Is it CLI that needs to be executed on the hosts or it has some API ? Or maybe it can be called from any server (let's say manager server VM) ?

Scylla Manager is not SSHing, it's calling Agent's API, that why I'm asking about the way to execute the job.

You want to merge it with the Scylla Manager task scheduler ?

karol-kokoszka avatar Feb 18 '25 13:02 karol-kokoszka

it's cli command that needs to be executed on target hosts ( https://github.com/scylladb/scylla-doctor/tree/master/scylla-doctor#usage ) as root

it results will be in a vitals file that will need a download to SM

tarzanek avatar Feb 18 '25 13:02 tarzanek

it's cli command that needs to be executed on target hosts

It means that we would need to call agent to execute the CLI and collect the output.

it results will be in a vitals file that will need a download to SM

Assuming the doctor is executed through API call to agent, it's not a problem, as it will be just in the payload.

@tarzanek could the scylla doctor be imported to the agent codebase somehow ? Through the golang dependency for example ? UPDATE: It's python, it couldn't.

How do you see scheduling this job ? Part of the task scheduler in manager (the same as we use for repair, backup) ? Is it needed to be scheduled ? Or it's rather "ad-hoc" job ?

karol-kokoszka avatar Feb 18 '25 13:02 karol-kokoszka

it's cli command that needs to be executed on target hosts ( https://github.com/scylladb/scylla-doctor/tree/master/scylla-doctor#usage ) as root

I'm concerned about the as root part. In general sm-agent shouldn't have root access to the machine. It's kind of suspicious for sm-agent to be able to run commands arbitrary with sudo.

@karol-kokoszka do you know what permissions does sm-agent have in the cloud?

Michal-Leszczynski avatar Feb 24 '25 09:02 Michal-Leszczynski

cc: @adambabik

karol-kokoszka avatar Feb 24 '25 09:02 karol-kokoszka

We should not proceed with this, until we have a much better understanding what problem we want to solve and what alternatives we have. We don't have any agreement, in terms of architecture, that Scylla Manager should be a hub that integrates various tools.

I see that we have Scylla Doctor as a part of various DevOps workflows in ArgoWF. We also have it run periodically. Why is that not enough?

adambabik avatar Feb 24 '25 12:02 adambabik

on prem customers basically lack the workflows so SM with its task engine looks like good place to schedule such tasks as log collection

tarzanek avatar Mar 27 '25 12:03 tarzanek

Ah I see, so it's only about the on prem customers. Then this should be discussed during the Manager planning. @karol-kokoszka @Michal-Leszczynski let's add it to the agenda for the Manager planning meeting.

adambabik avatar Mar 27 '25 14:03 adambabik