narwhal Refactor BlockRemover Component

There seems to be a lot of similarities with how BlockRemover & BlockWaiter operate and are used by GetCollections & RemoveCollections. Because of this there seems to be some items that can be refactored and shared.

Combine BlockErrorType into one shared type

// BlockWaiter
pub enum BlockErrorType {
    BlockNotFound,
    BatchTimeout,
    BatchError,
}

// BlockRemover
pub enum BlockErrorType {
    Timeout,
    Failed,
}

Proposed Shared BlockErrorType

 pub enum BlockErrorType {
    BlockNotFound,
    BatchTimeout, // maps to block_remover::BlockErrorType::Timeout 
    BatchError, // maps to block_remover::BlockErrorType::Failed 
}

Change return type of BlockRemover to be the same as BlockWaiter

We should return an error per collection that is attempted to be removed, instead of one error for the batch of collections.

// BlockRemover
pub struct BlockRemoverError {
    pub ids: Vec<CertificateDigest>,
    pub error: BlockErrorType,
}

// BlockWaiter
pub struct BlockError {
    pub id: CertificateDigest,
    pub error: BlockErrorType,
}

Change tx_remove_block/rx_remove_block in the diagram below to a oneshot channel instead of a mpsc channel.

We only send one list of collections in to the gRPC endpoint and then in turn to the BlockRemover. Oneshot channel will suffice here. Similar to how BlockWaiter does it

Narwhal Channel Flow-RemoveCollections drawio

May 10 '22 18:05 arun-koshy

/cc @akichidis

May 13 '22 13:05 huitseeker

I'll start from the end to the top on the above points:

Change tx_remove_block/rx_remove_block in the diagram below to a oneshot channel instead of a mpsc channel.

Yes, we should absolutely do that

Change return type of BlockRemover to be the same as BlockWaiter We should return an error per collection that is attempted to be removed, instead of one error for the batch of collections.

We can definitely do that, but we might want to see how far we want to go with this. Basically, if we want to do it just on the API level (meaning instead of sending one message back , just send N ones) then it should be easy. Otherwise, we might need to perform a deeper refactoring if we want to ensure that block removal is treated in a non batch fashion, but rather independently. block_remover is working primarily in a batched fashion in various places (e.x when we issue delete requests to delete the payload from the workers we do it for multiple certificates and are multiplexed) and an error there will propagate to the top without allowing continue with deletion on single entities. That being said, some light re-design will be needed to achieve this properly

Combine BlockErrorType into one shared type

 pub enum BlockErrorType {
    BlockNotFound,
    BatchTimeout, // maps to block_remover::BlockErrorType::Timeout 
    BatchError, // maps to block_remover::BlockErrorType::Failed 
}

I am a bit skeptical with this in a sense that as those components might evolve in the future, other errors/necessities might be surfaced making it then harder to evolve our gRPC api since we have coupled the responses of those endpoints. That being said, if we go forward with aligning the errors, I would first reduce them to a smaller more abstract subset and then do the unification. For example, the BatchTimeout for the get_collections could be a generic now Timeout. So would look for a proto struct like:

pub enum BlockErrorType {
   NotFound,
   Timeout,
   Error
}

@arun-koshy @huitseeker thoughts?

May 16 '22 12:05 akichidis