joystream
joystream copied to clipboard
Storage bags replication risk
Problem:
- Replication of channel/bag to buckets/nodes only happened at the initial when the channel/bag is created.
- As the buckets/nodes fail/retire and new buckets/nodes introduced, channel/bag only stay attached to the remaining active buckets/nodes of the initial set.
- This create a chance of a data loss if all the initial replication buckets/nodes fail/retire.
Solution
Every time 'yarn storage-node leader:update-bag' issued
- Option1 (preferred) : It automatically rebalance the replication and error if no enough buckets/nodes are available.
- Option 2 : Force the lead to rebalance, by erroring when removing buckets/nodes from channel/bag if the minimum replication configured is not met.
Solution:
Can you add some information on how urgent this problem is, are we seeing mistakes happen now, do we expect this to occur in the near future?
It automatically rebalance the replication and error if no enough buckets/nodes are available.
Can you be more precise about what this would mean exactly? an example would be great.
Force the lead to rebalance, by erroring if by removing buckets/nodes from channel/bag the minimum replication configured is not meet.
In general, I think it is better to warn the lead very loudly, rather than doing things automatically which were not desired. Or as an alternative prevent the action even, but that could be super annoying if the lead really wants to do something, say on a testing environment or something where circumstances are different. Perhaps there could be sort of an override flag to force it, that is a decent compromise.
I would say the current system lack resilience and depend on the competent and commitment of Storage lead, and his understanding on how things works. We need to think about a depth of resiliency, a system that will operate even if multiple failures happen. the system should assume that the node will fail.
In short: 1- With the current implementation the user will eventually encounter failed upload quite often, 2- The current system require higher hours commitment from the lead, 3- I do not think go to mainnet before a meaningful improvement.
Force the lead to rebalance, by erroring when removing buckets/nodes from channel/bag if the minimum replication configured is not met.
Example bucket/node 17 fails, the lead has to remove all the bags assigned to it so the users will not encounter failed uploads to it. The command should enforce one of the following options: 1- The command automatically replace the bucket with another that currently accepting bags if the configured replication limit is not met. 2- The command require the lead to configure the replacement bucket if the configured replication limit is not met.
It would also be nice to have a flag that disable replication for a specific bag/channel .
Example bucket/node 17 fails
Sorry, what does that mean? fails in what way? if the node dies or something, the lead can simply switch out the operator with a new one, and the bucket can be disabled in the interim (I believe). There should be no need to update the bags in any way. Can you perhaps add more context here.
Failure is expected with server:
- Network/ DC risk
- Server issue
- -operational mistakes In theory it is simple but in reality switching operator is a lot of commands, again this assumes competent lead that will switch operator rather than just remove the failing operator (a behavior that can eventually lead to homeless bags as nodes fail).
In theory it is simple but in reality switching operator is a lot of commands
Can you be more specific? Its one extrinsic basically, for the lead, what is the problem?
again this assumes competent lead
I think we cannot assume the system can work with incompetent leads, if the person does not know how the system works and what they can do, then obviously they can't do a good job. I do agree we should try to make their job easier with good documentation and tools, but these improvements have to be evaluated one by one.
To be honest, I still dont fully understand what problems we need to fix when rereading this issue.
- This is not true, replication occurs when new operator is assigned a bag =>
Replication of channel/bag to buckets/nodes only happened at the initial when the channel/bag is created.
- Not sure what this really means = >
As the buckets/nodes fail/retire and new buckets/nodes introduced, channel/bag only stay attached to the remaining active buckets/nodes of the initial set.
- This is not a good idea, because the tool will not know informatino about who is prepared to do what at what time =>
Option1 (preferred)
- This is not a good idea for the reason I already described =>
Option 2 : Force the lead to rebalance, by erroring when removing buckets/nodes from channel/bag if the minimum replication configured is not met.
I think we probably need a storage team meeting to settle some of these questions.
I think there is a misunderstanding here. I am not talking about about the file replication, but rather assigning a new replication operator.
Regarding This is not a good idea
This how the current industry works!
Regarding This is not a good idea This how the current industry works!
You have to deal with this fundamental point: the tool will not know information about who is prepared to do what at what time
.
I don't know what specifically you are referring to when you say current industry
, but this may be from contexts where one actor actually owns all relevant infrastructure and can know/guarantee their preparedness and availability.
Either way, most of the value here is simply making the lead aware, not doing the work, because that part is easy.
Closing this discussion for now, replaced with this enhancement: https://github.com/Joystream/joystream/issues/4444