Context

When Storage & Distribution node operators put their nodes down for maintenance or upgrade purposes. They should be able to publish that action on the chain so that Leads/Apps are aware of the node status.

May 18 '23 12:05 zeeshanakram3

Here are some loose requirements I have from a conversation with a lead

Both lead and worker, in each case, should be able to signal a given host as being unavailable for use. If a given worker has multiple distinct hosts for the same role, then presumably the flag should be apply to a given host, not just the role, but the leads perhaps should be consulted on this.
New state is introduced for each host: a status operational_status with value range (x,y, use epoch time, not blocks)
- normal: all APIS work, as now.
- no_service(bool forced): no APIs should be expected to work. A more granular representation could be picked here, but seems premature at this point.
- no_service_from(bool forced,x): normal mode before time x, no_service after.
- no_service_during(bool forced, x,y): normal more before and after x and y time, no_service after.
The forced indicator means the state was entered into by lead, and it is meant to prevent worker from unilaterally reversing.
Leads can set any host to any new status value at all times using extrinsic with a message over working group lead remark for given corresponding group, and message should include field for free text message explaining background for state change.
Workers can set their own host status in the corresponding way under the following more restrictive rules
- any unforced non-service state when currently either normal or in an unforced non-service state.
- normal state if they are in an unforced non-service state.

I think a reasonable next steps are to

Prepare a detailed spec.
Have both leads review: but avoid feature bloat!
Prepare implementation plan with details about what pieces of software must be updated, what tests should be updated, and how to deploy safely.

May 18 '23 15:05 bedeho

Update

Seems I missed that there are indeed distinct remarks for storage operators and distributor operators, so perhaps those should be used instead, rather than using the generic working group extriniscs, as means the mappings should be a bit cleaner and simpler, however I am not sure the leads could be tackled this way, some more careful examination is required.

May 18 '23 16:05 bedeho

Here is a feature specification highlighting all the required QN, metaprotocol & protobuf changes

QN Changes

Following schema changes would be added to query-node's storage related entities

### New Variants

type NodeOperationalStatusNormal @variant {
  _phantom: Int
}

type NodeOperationalStatusNoService @variant {
  "Whether the state was set by lead (true) or by the operator (false), it is meant to prevent worker from unilaterally reversing."
  forced: Boolean!
}

type NodeOperationalStatusNoServiceFrom @variant {
  "Whether the state was set by lead (true) or by the operator (false), it is meant to prevent worker from unilaterally reversing."
  forced: Boolean!

  "The time from which the bucket would have to no service"
  from: DateTime!
}

type NodeOperationalStatusNoServiceDuring @variant {
  "Whether the state was set by lead (true) or by the operator (false), it is meant to prevent worker from unilaterally reversing."
  forced: Boolean!

  "The time from which the bucket would have to no service"
  from: DateTime!

  "The time until which the bucket would have to no service"
  to: DateTime!
}


type StorageBucketOperatorMetadata @entity {
  "Optional node operational status"
  nodeOperationalStatus: NodeOperationalStatus
  
  # Other additional metadata ...
}


type DistributionBucketOperatorMetadata @entity {
  "Optional node operational status"
  nodeOperationalStatus: NodeOperationalStatus

  # Other additional metadata ...
}

Following two Event entities would be created, i.e. StorageNodeOperationalStatusSetEvent & DistributionNodeOperationalStatusSetEvent

type StorageNodeOperationalStatusSetEvent implements Event @entity {
  ### GENERIC DATA ###

  "(network}-{blockNumber}-{indexInBlock}"
  id: ID!

  "Hash of the extrinsic which caused the event to be emitted"
  inExtrinsic: String

  "Blocknumber of the block in which the event was emitted."
  inBlock: Int!

  "Network the block was produced in"
  network: Network!

  "Index of event in block from which it was emitted."
  indexInBlock: Int!

  ### SPECIFIC DATA ###

  "Storage bucket"
  bucket: StorageBucket

  "Storage bucket operator"
  operator: StorageBucketOperator

  "Related opening"
  operationalStatus: BucketOperationalStatus!
}

type DistributionNodeOperationalStatusSetEvent implements Event @entity {
  ### GENERIC DATA ###

  "(network}-{blockNumber}-{indexInBlock}"
  id: ID!

  "Hash of the extrinsic which caused the event to be emitted"
  inExtrinsic: String

  "Blocknumber of the block in which the event was emitted."
  inBlock: Int!

  "Network the block was produced in"
  network: Network!

  "Index of event in block from which it was emitted."
  indexInBlock: Int!

  ### SPECIFIC DATA ###

  "Distribution bucket"
  bucket: DistributionBucket

  "Distribution bucket operator"
  operator: DistributionBucketOperator

  "Related opening"
  nodeOperationalStatus: NodeOperationalStatus!
}

Orion Changes

Where the mappings for this feature should be added, QN? Orion? or Both? Argus & Colossus are integrated with QN, so will Argus or Colossus depend on the indexed operational status (i.e they need to know about operational peers for syncing purposes), if yes, then QN would have these mappings. Does Orion need to know about the operational status of the node (i.e. for resolving assets), then Orion would also have those mappings. In short, I think both QN & Orion needs to include mappings for this feature, WDYT?

Protobuf Changes

message NodeOperationalStatusMetadata {
  enum OperationalStatus {
    // Node is operating normally
    NORMAL = 0;

    // Node is not operational 
    NO_SERVICE = 1;

    // Node won't operational from date
    NO_SERVICE_FROM = 2;

    // Node won't operational during date
    NO_SERVICE_DURING = 3;
  }

  // Node's Operational status to set
  optional OperationalStatus status = 1; 

  // Date from which the node won't be operational (Should be set if status is NoServiceFrom or NoServiceDuring)
  optional string no_service_from = 2; 

  // Date until which the node won't be operational (Should be set if status is NoServiceDuring)
  optional string no_service_to = 3; 

  // Rationale for setting the current status
  optional string rationale = 4; 
}


// Added `operational_status` field to already existing `StorageBucketOperatorMetadata` message
message StorageBucketOperatorMetadata {
  // ... (Other node operator metadata)
  optional NodeOperationalStatusMetadata operational_status = 4; // Node's operational status to set
}

message SetNodeOperationalStatus {
  optional NodeOperationalStatusMetadata operational_status = 1; // Node's operational status to set
  optional string worker_id = 2; // Storage/Distribution Worker ID
  optional string bucket_id = 3; // Storage/Distribution Bucket ID 
}

message LeadRemarked {
  // lead_remark extrinsic would emit event containing
  // any one of the following serialized messages
  oneof lead_remarked {
    SetNodeOperationalStatus set_node_operational_status = 1;
    // Other lead remark messages 
    // ...
  }
}

Leads would be able to set the host to any operational status using LeadRemarked::SetNodeOperationalStatus protobuf message (which would include the worker_id & buckek_id for which the status needs to be set) over working group's lead_remark extrinsic,
Workers would use a simpler NodeOperationalStatusMetadata message either over exclusive *_operator_remark extrinsics i.e., storage_operator_remark and distribution_operator_remark. OR they can use set_*_operator_metadata extrinsics, i.e., set_storage_operator_metadata and set_distribution_operator_metadata for setting the operational status of their own host

CLI commands

Following CLI commands would be added/updated to both Argus & Colossus to set the node's operational status

Add leader:set-node-operational-status
Update operator:set-metadata

Integration tests

Integration tests need to be added for the mappings, both in the monorepo and as well as Orion.

May 29 '23 13:05 zeeshanakram3

Awesome work @zeeshanakram3 , perhaps @mnaamani can review this?

May 29 '23 15:05 bedeho

Proposal looks great.

I would recommend using the set_*_operator_metadata approach for the workers, given that the mappings tests individual fields of the message so operator can use it to just set the new NodeOperationalStatusMetadata field. It also makes logical sense because we are saving this status in the operator metadata.

As for the lead setting operators' metadata it only makes sense to do it via the lead_remark extrinsic. However I see you are introducing a new message called LeadRemarked I assume this will replace the existing RemarkMetadataAction. I agree with the renaming it makes more sense. This means we need to bump the major version of the metadata-protobuf package.

message RemarkMetadataAction {
  oneof action {
    ModeratePost moderate_post = 1;
  }
}

Ofcourse you will also need to add a new field NodeOperationalStatusMetadata to the DistributionBucketOperatorMetadata message.

In the mappings you will need to handle case where lead might do leader:set-node-operational-status before the operator sets their metadata. The mapping to handle LeadRemark will need to validate worker_id and and bucket_id as they are not validated in the runtime.

I think it goes without saying that all DateTime would be in UTC?

will the value of NodeOperationalStatus in the database be the "normal" variant when database is reset or if it is not explicitly set?

Jun 02 '23 06:06 mnaamani

joystream
joystream copied to clipboard

Add support to mark storage-node/distrubtion-node as under maintenance by the Node operators

Context

Update

QN Changes

Orion Changes

Protobuf Changes

CLI commands

Integration tests

joystream joystream copied to clipboard

Add support to mark storage-node/distrubtion-node as under maintenance by the Node operators

Context

Update

QN Changes

Orion Changes

Protobuf Changes

CLI commands

Integration tests

joystream
joystream copied to clipboard