joystream
joystream copied to clipboard
Add support to mark storage-node/distrubtion-node as under maintenance by the Node operators
Context
When Storage & Distribution node operators put their nodes down for maintenance or upgrade purposes. They should be able to publish that action on the chain so that Leads/Apps are aware of the node status.
Here are some loose requirements I have from a conversation with a lead
- Both lead and worker, in each case, should be able to signal a given host as being unavailable for use. If a given worker has multiple distinct hosts for the same role, then presumably the flag should be apply to a given host, not just the role, but the leads perhaps should be consulted on this.
- New state is introduced for each host: a status
operational_status
with value range (x,y, use epoch time, not blocks)-
normal
: all APIS work, as now. -
no_service(bool forced)
: no APIs should be expected to work. A more granular representation could be picked here, but seems premature at this point. -
no_service_from(bool forced,x)
: normal mode before timex
,no_service
after. -
no_service_during(bool forced, x,y)
: normal more before and afterx
andy
time,no_service
after.
-
- The
forced
indicator means the state was entered into by lead, and it is meant to prevent worker from unilaterally reversing. - Leads can set any host to any new status value at all times using extrinsic with a message over working group lead remark for given corresponding group, and message should include field for free text message explaining background for state change.
- Workers can set their own host status in the corresponding way under the following more restrictive rules
- any unforced non-service state when currently either normal or in an unforced non-service state.
- normal state if they are in an unforced non-service state.
I think a reasonable next steps are to
- Prepare a detailed spec.
- Have both leads review: but avoid feature bloat!
- Prepare implementation plan with details about what pieces of software must be updated, what tests should be updated, and how to deploy safely.
Update
Seems I missed that there are indeed distinct remarks for storage operators and distributor operators, so perhaps those should be used instead, rather than using the generic working group extriniscs, as means the mappings should be a bit cleaner and simpler, however I am not sure the leads could be tackled this way, some more careful examination is required.
Here is a feature specification highlighting all the required QN, metaprotocol & protobuf changes
QN Changes
Following schema changes would be added to query-node's storage related entities
### New Variants
type NodeOperationalStatusNormal @variant {
_phantom: Int
}
type NodeOperationalStatusNoService @variant {
"Whether the state was set by lead (true) or by the operator (false), it is meant to prevent worker from unilaterally reversing."
forced: Boolean!
}
type NodeOperationalStatusNoServiceFrom @variant {
"Whether the state was set by lead (true) or by the operator (false), it is meant to prevent worker from unilaterally reversing."
forced: Boolean!
"The time from which the bucket would have to no service"
from: DateTime!
}
type NodeOperationalStatusNoServiceDuring @variant {
"Whether the state was set by lead (true) or by the operator (false), it is meant to prevent worker from unilaterally reversing."
forced: Boolean!
"The time from which the bucket would have to no service"
from: DateTime!
"The time until which the bucket would have to no service"
to: DateTime!
}
type StorageBucketOperatorMetadata @entity {
"Optional node operational status"
nodeOperationalStatus: NodeOperationalStatus
# Other additional metadata ...
}
type DistributionBucketOperatorMetadata @entity {
"Optional node operational status"
nodeOperationalStatus: NodeOperationalStatus
# Other additional metadata ...
}
Following two Event entities would be created, i.e. StorageNodeOperationalStatusSetEvent
& DistributionNodeOperationalStatusSetEvent
type StorageNodeOperationalStatusSetEvent implements Event @entity {
### GENERIC DATA ###
"(network}-{blockNumber}-{indexInBlock}"
id: ID!
"Hash of the extrinsic which caused the event to be emitted"
inExtrinsic: String
"Blocknumber of the block in which the event was emitted."
inBlock: Int!
"Network the block was produced in"
network: Network!
"Index of event in block from which it was emitted."
indexInBlock: Int!
### SPECIFIC DATA ###
"Storage bucket"
bucket: StorageBucket
"Storage bucket operator"
operator: StorageBucketOperator
"Related opening"
operationalStatus: BucketOperationalStatus!
}
type DistributionNodeOperationalStatusSetEvent implements Event @entity {
### GENERIC DATA ###
"(network}-{blockNumber}-{indexInBlock}"
id: ID!
"Hash of the extrinsic which caused the event to be emitted"
inExtrinsic: String
"Blocknumber of the block in which the event was emitted."
inBlock: Int!
"Network the block was produced in"
network: Network!
"Index of event in block from which it was emitted."
indexInBlock: Int!
### SPECIFIC DATA ###
"Distribution bucket"
bucket: DistributionBucket
"Distribution bucket operator"
operator: DistributionBucketOperator
"Related opening"
nodeOperationalStatus: NodeOperationalStatus!
}
Orion Changes
Where the mappings for this feature should be added, QN? Orion? or Both? Argus & Colossus are integrated with QN, so will Argus or Colossus depend on the indexed operational status (i.e they need to know about operational peers for syncing purposes), if yes, then QN would have these mappings. Does Orion need to know about the operational status of the node (i.e. for resolving assets), then Orion would also have those mappings. In short, I think both QN & Orion needs to include mappings for this feature, WDYT?
Protobuf Changes
message NodeOperationalStatusMetadata {
enum OperationalStatus {
// Node is operating normally
NORMAL = 0;
// Node is not operational
NO_SERVICE = 1;
// Node won't operational from date
NO_SERVICE_FROM = 2;
// Node won't operational during date
NO_SERVICE_DURING = 3;
}
// Node's Operational status to set
optional OperationalStatus status = 1;
// Date from which the node won't be operational (Should be set if status is NoServiceFrom or NoServiceDuring)
optional string no_service_from = 2;
// Date until which the node won't be operational (Should be set if status is NoServiceDuring)
optional string no_service_to = 3;
// Rationale for setting the current status
optional string rationale = 4;
}
// Added `operational_status` field to already existing `StorageBucketOperatorMetadata` message
message StorageBucketOperatorMetadata {
// ... (Other node operator metadata)
optional NodeOperationalStatusMetadata operational_status = 4; // Node's operational status to set
}
message SetNodeOperationalStatus {
optional NodeOperationalStatusMetadata operational_status = 1; // Node's operational status to set
optional string worker_id = 2; // Storage/Distribution Worker ID
optional string bucket_id = 3; // Storage/Distribution Bucket ID
}
message LeadRemarked {
// lead_remark extrinsic would emit event containing
// any one of the following serialized messages
oneof lead_remarked {
SetNodeOperationalStatus set_node_operational_status = 1;
// Other lead remark messages
// ...
}
}
- Leads would be able to set the host to any operational status using
LeadRemarked::SetNodeOperationalStatus
protobuf message (which would include theworker_id
&buckek_id
for which the status needs to be set) over working group'slead_remark
extrinsic, - Workers would use a simpler
NodeOperationalStatusMetadata
message either over exclusive*_operator_remark
extrinsics i.e.,storage_operator_remark
anddistribution_operator_remark
. OR they can useset_*_operator_metadata
extrinsics, i.e.,set_storage_operator_metadata
andset_distribution_operator_metadata
for setting the operational status of their own host
CLI commands
Following CLI commands would be added/updated to both Argus & Colossus to set the node's operational status
- Add
leader:set-node-operational-status
- Update
operator:set-metadata
Integration tests
Integration tests need to be added for the mappings, both in the monorepo and as well as Orion.
Awesome work @zeeshanakram3 , perhaps @mnaamani can review this?
Proposal looks great.
I would recommend using the set_*_operator_metadata
approach for the workers, given that the mappings tests individual fields of the message so operator can use it to just set the new NodeOperationalStatusMetadata
field. It also makes logical sense because we are saving this status in the operator metadata.
As for the lead setting operators' metadata it only makes sense to do it via the lead_remark extrinsic. However I see you are introducing a new message called LeadRemarked
I assume this will replace the existing RemarkMetadataAction
. I agree with the renaming it makes more sense. This means we need to bump the major version of the metadata-protobuf package.
message RemarkMetadataAction {
oneof action {
ModeratePost moderate_post = 1;
}
}
Ofcourse you will also need to add a new field NodeOperationalStatusMetadata
to the DistributionBucketOperatorMetadata
message.
In the mappings you will need to handle case where lead might do leader:set-node-operational-status
before the operator sets their metadata. The mapping to handle LeadRemark will need to validate worker_id and and bucket_id as they are not validated in the runtime.
I think it goes without saying that all DateTime would be in UTC?
will the value of NodeOperationalStatus
in the database be the "normal" variant when database is reset or if it is not explicitly set?