massa
massa copied to clipboard
Network/Protocol Interface Refactor(remove async)
Opening the PR to start taking notes...
Part of https://github.com/massalabs/massa/discussions/3149
NetworkCommandSender
is currently used by:
-
massa-api
-
massa-bootstrap
-
massa-protocol-worker
-
massa-protocol-exports
(testing only)
NetworkEventReceiver
is currently used by:
-
massa-protocol-worker
-
massa-protocol-exports
(shut-down of the manager only)
Re NetworkEventReceiver
, its use points towards massa-protocol
being the sole sink for events that would be coming from a "previous pipeline step", the network.
All the protocol specific events, like ReceivedBlockInfo
, could be simplified into a single MessageReceived
, perhaps containing still the raw bytes, to be deserialized in a new step, which would then pass them on to massa-protocol-worker
.
Re NetworkCommandSender
, I think we should minimize it to a lower-level "send message" type of interface, and move all the other stuff, like the massa-api
calls, to be dealt with through massa-protocol-exports
So, after having looked at the whole structure again, I think we should go for a more minimal, yet effective, refactoring, that would essentially consist of:
- Remove async from
massa-network-exports
. - Make
massa-network-worker
a thread - A Make each
NodeWorker
a thread, where (de-)serialization of messages would take place as well. - B Potentially have each
NodeWorker
share some kind ofMessageSerializerThreadPool
, that could batch and parallelize serialization more effeciently. - Make
Write/ReadBinders
an abstract interface to the raw networking(soWriteBinder.send(&[u8])
instead ofsend(&mut self, msg: &Message)
). - Have a single
HandshakeWorker
manage all handshakes from a thread, using the abstractWrite/ReadBinders
.
That way we don't have to change too much across the whole project(we keep massa-network-exports
as it is), and we can still swap network stacks using the abstract Write/ReadBinders
.
Please let me know what you think. @damip @massalabs/core-team
I think A is more easy and does B provide really better performance then A ?
I think A would be better in any case, because we also want to balance resources used by different connected nodes.
Here are some things that need to be factored in (plus the ones already listed in the discussion https://github.com/massalabs/massa/discussions/3149 ):
- make sure that pipes are as short as possible (eg. not going through 5 channels for each message)
- prioritize block management over operation management (see discussion schematic for priority channels)
- get rid of the "select loops": split every task into a separate thread that won't block any loop: block propagation, node management etc... so that they can all run in parallel and not slow down each other, see discussion
@damip @AurelienFT Yes I think A is the better initial approach.
So this PR is only about the refactoring of the current massa-network
, so block/operation stuff would be part of a refactor of massa-protocol
.
get rid of the "select loops": split every task into a separate thread that won't block any loop: block propagation, node management etc... so that they can all run in parallel and not slow down each other, see discussion
The potential problems with the "separate thread for everyhting approach" are:
- You don't have enough CPUs to actually run things in parallel
- You probably have to share state so acquiring locks will have a very similar effect to simply running things sequentially on one thread.
- You are still limited by the network bandwitdh, as well as the CPU needed for serialization.
Off-course there can be opportunities to split things in thread, and minize shared state, shorten critical sections, have mostly readers, and so on...
Again more of a massa-protocol
thing I think, to be discussed separately.
@AurelienFT @damip @massalabs/core-team Ok so I propose to start the implementation phase of the changes described at https://github.com/massalabs/massa/pull/3272#issuecomment-1331875759 (using option A)
Ok folks I am back this week, sorry I can't make it to the call today, will be focusing this week on the below in this PR:
- Re-add threaded timer resets in protocol worker.
- Threaded Nodeworkers.
- Threaded Handshakeworker(singular).
@massalabs/core-team
Is everything going well ?
Is everything going well ?
Yes, let's discuss progress on Monday...
TODO:
- Node connection close/banning/Shutdown of node workers on shutdown of system.
- try_send_peer_list_in_handshake
- protocol timer logic
- test suite
TODO:
- Make tests compile
- general cleanup(config values for channels, etc...)
@massalabs/core-team
Ready for review, although following the rebase I'm not sure about the test suite anymore, see https://github.com/massalabs/massa/issues/3465
Could someone please help me double check the changes to the timer logic in protocol_worker
?
Will also be taking another look at docs...
@gterzian We are having hard time reviewing the PR as there is some design choices, decisions, that we are not aware of. Maybe we can find some of the information in the discussion, issue/PR but it's hard to follow the flow of a PR opened since 2 months.
Do you mind making a presentation after our weekly meeting on Monday ? So that we are all more aware of the high level context and why you made some design choices (for example keeping tokio for low level networking)
Do you mind making a presentation after our weekly meeting on Monday ?
@AurelienFT good idea.
After the rebase, two tests are still failing(intermittently?). I'm looking into those:
---- tests::in_block_operations_scenarios::test_protocol_does_propagate_operations_received_in_blocks stdout ----
thread 'tokio-runtime-worker' panicked at 'called `Option::unwrap()` on a `None` value', massa-protocol-worker/src/tests/in_block_operations_scenarios.rs:92:26
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'tests::in_block_operations_scenarios::test_protocol_does_propagate_operations_received_in_blocks' panicked at 'called `Result::unwrap()` on an `Err` value: JoinError::Panic(Id(34), ...)', massa-protocol-worker/src/tests/in_block_operations_scenarios.rs:97:18
---- tests::in_block_operations_scenarios::test_protocol_sends_blocks_with_operations_to_consensus stdout ----
thread 'tokio-runtime-worker' panicked at 'called `Option::unwrap()` on a `None` value', massa-protocol-worker/src/tests/in_block_operations_scenarios.rs:203:30
thread 'tests::in_block_operations_scenarios::test_protocol_sends_blocks_with_operations_to_consensus' panicked at 'called `Result::unwrap()` on an `Err` value: JoinError::Panic(Id(41), ...)', massa-protocol-worker/src/tests/in_block_operations_scenarios.rs:208:22
After the rebase, two tests are still failing(intermittently?).
These are indeed intermittent failure, and I believe they will go away once we do https://github.com/massalabs/massa/issues/3468, so I will leave it as it is for now.
Thanks for your review @AurelienFT
The goal wasn't to have a thread per connection that manage the connection from handshake to the the disconnect ?
While it may have been part of the discussion, I think the current changes fit better into what is already there, and the current changes are already quite large as it is.
I don't think it's a good idea to have one thread manage the entire lifeline of a connection, because in fact there are multiple concerns such as: the initial handshake, the active node, and so on, and these are best managed separately. But I could discuss a more concrete proposal as a follow-up.
In the per connections threads a connection can't be blocked by a task in the pipeline that has too much load.
I think the current structure also enjoys that benefit, for example each handshake is spawned as a separate task.
I don't think it's a good idea to have one thread manage the entire lifeline of a connection, because in fact there are multiple concerns such as: the initial handshake, the active node, and so on, and these are best managed separately.
When looking at @damip proposal it's not a single thread of the whole life of a connection but it's a thread by connection.
I think the current structure also enjoys that benefit, for example each handshake is spawned as a separate task.
Good point I misunderstood a part of it.
I don't think I have enough knowledge to know if it's better to have one thread for each "live" connection than your "pipeline" suggestion I just find it more optimize as there is too much message passing and more readable but it's not a strong opinion at all.
I don't think I have enough knowledge to know if it's better to have one thread for each "live" connection than your "pipeline" suggestion I just find it more optimize as there is too much message passing and more readable but it's not a strong opinion at all.
Let's discuss this in a follow-up stage, this PR is the minimal first step, which retains a lot of the existing structures(node worker, handshake worker), and adds some minimal new ones(handshake manager).
Also note that there is now one thread per node worker, which is the "active connection".
I'm working on this https://github.com/massalabs/massa/pull/3272#discussion_r1095884149 and after I think we can try to merge.
Left a comment on the initial discussion https://github.com/massalabs/massa/discussions/3149#discussioncomment-4862894
after I think we can try to merge.
I think you can look at Thomas's comment before and also did you tested the PR in a network simulated to see if it actually don't break anything ?
Could these intermittant failures be related to https://github.com/massalabs/massa/issues/3509?
Could these intermittant failures be related to https://github.com/massalabs/massa/issues/3509?
These seem to be different tests, so I don't know. Do you have an idea?
tested the PR in a network simulated to see if it actually don't break anything
How does this work again? Can you remind me, or point to to the docs? @AurelienFT
tested the PR in a network simulated to see if it actually don't break anything
How does this work again? Can you remind me, or point to to the docs? @AurelienFT
You can launch a labnet with the python script on private-deployment repository.
@AurelienFT is the testnet deployment description valid for the labnet as well? Once it's been deployed you can interact with it throught a client, as described at https://docs.massa.net/en/latest/testnet/running.html ?
@gterzian Hello,
No you just need to modify the commit in labnet/deploy.py and launch deploy.py @Ben-PH use it for his bootstrap PR. Btw, I sent you an email if you can check it out :)
@gterzian
The addresses in labnet deploy are out of date (since the user/sc address enum distinction was merged).
I'll make a PR updating them now.
EDIT: You can use the branch in this PR: https://github.com/massalabs/private-deployment/pull/2/
EDIT: You can use the branch in this PR: https://github.com/massalabs/private-deployment/pull/2
Thanks. Are you done using it for bootstrap? Can I deploy this on labnet instead?
As explained by mail, we will take the work back on this network refactor as we diverged too much on this PR. I place it as draft to keep it as resources for the refactor we will perform in the coming months. Thanks for the work put in place to draft this !