massa icon indicating copy to clipboard operation
massa copied to clipboard

Network/Protocol Interface Refactor(remove async)

Open gterzian opened this issue 2 years ago • 14 comments

Opening the PR to start taking notes...

Part of https://github.com/massalabs/massa/discussions/3149

gterzian avatar Nov 29 '22 08:11 gterzian

NetworkCommandSender is currently used by:

  • massa-api
  • massa-bootstrap
  • massa-protocol-worker
  • massa-protocol-exports(testing only)

NetworkEventReceiver is currently used by:

  • massa-protocol-worker
  • massa-protocol-exports(shut-down of the manager only)

Re NetworkEventReceiver, its use points towards massa-protocol being the sole sink for events that would be coming from a "previous pipeline step", the network.

All the protocol specific events, like ReceivedBlockInfo, could be simplified into a single MessageReceived, perhaps containing still the raw bytes, to be deserialized in a new step, which would then pass them on to massa-protocol-worker.

Re NetworkCommandSender, I think we should minimize it to a lower-level "send message" type of interface, and move all the other stuff, like the massa-api calls, to be dealt with through massa-protocol-exports

gterzian avatar Nov 29 '22 09:11 gterzian

So, after having looked at the whole structure again, I think we should go for a more minimal, yet effective, refactoring, that would essentially consist of:

  1. Remove async from massa-network-exports.
  2. Make massa-network-worker a thread
  3. A Make each NodeWorker a thread, where (de-)serialization of messages would take place as well.
  4. B Potentially have each NodeWorker share some kind of MessageSerializerThreadPool, that could batch and parallelize serialization more effeciently.
  5. Make Write/ReadBinders an abstract interface to the raw networking(so WriteBinder.send(&[u8]) instead of send(&mut self, msg: &Message)).
  6. Have a single HandshakeWorker manage all handshakes from a thread, using the abstract Write/ReadBinders.

That way we don't have to change too much across the whole project(we keep massa-network-exports as it is), and we can still swap network stacks using the abstract Write/ReadBinders.

Please let me know what you think. @damip @massalabs/core-team

gterzian avatar Nov 30 '22 09:11 gterzian

I think A is more easy and does B provide really better performance then A ?

AurelienFT avatar Nov 30 '22 09:11 AurelienFT

I think A would be better in any case, because we also want to balance resources used by different connected nodes.

Here are some things that need to be factored in (plus the ones already listed in the discussion https://github.com/massalabs/massa/discussions/3149 ):

  • make sure that pipes are as short as possible (eg. not going through 5 channels for each message)
  • prioritize block management over operation management (see discussion schematic for priority channels)
  • get rid of the "select loops": split every task into a separate thread that won't block any loop: block propagation, node management etc... so that they can all run in parallel and not slow down each other, see discussion

damip avatar Nov 30 '22 10:11 damip

@damip @AurelienFT Yes I think A is the better initial approach.

So this PR is only about the refactoring of the current massa-network, so block/operation stuff would be part of a refactor of massa-protocol.

get rid of the "select loops": split every task into a separate thread that won't block any loop: block propagation, node management etc... so that they can all run in parallel and not slow down each other, see discussion

The potential problems with the "separate thread for everyhting approach" are:

  1. You don't have enough CPUs to actually run things in parallel
  2. You probably have to share state so acquiring locks will have a very similar effect to simply running things sequentially on one thread.
  3. You are still limited by the network bandwitdh, as well as the CPU needed for serialization.

Off-course there can be opportunities to split things in thread, and minize shared state, shorten critical sections, have mostly readers, and so on...

Again more of a massa-protocol thing I think, to be discussed separately.

gterzian avatar Dec 01 '22 07:12 gterzian

@AurelienFT @damip @massalabs/core-team Ok so I propose to start the implementation phase of the changes described at https://github.com/massalabs/massa/pull/3272#issuecomment-1331875759 (using option A)

gterzian avatar Dec 05 '22 08:12 gterzian

Ok folks I am back this week, sorry I can't make it to the call today, will be focusing this week on the below in this PR:

  1. Re-add threaded timer resets in protocol worker.
  2. Threaded Nodeworkers.
  3. Threaded Handshakeworker(singular).

@massalabs/core-team

gterzian avatar Dec 19 '22 07:12 gterzian

Is everything going well ?

AurelienFT avatar Dec 22 '22 09:12 AurelienFT

Is everything going well ?

Yes, let's discuss progress on Monday...

gterzian avatar Dec 23 '22 07:12 gterzian

TODO:

  • Node connection close/banning/Shutdown of node workers on shutdown of system.
  • try_send_peer_list_in_handshake
  • protocol timer logic
  • test suite

gterzian avatar Dec 26 '22 07:12 gterzian

TODO:

  1. Make tests compile
  2. general cleanup(config values for channels, etc...)

gterzian avatar Jan 02 '23 08:01 gterzian

@massalabs/core-team

Ready for review, although following the rebase I'm not sure about the test suite anymore, see https://github.com/massalabs/massa/issues/3465

Could someone please help me double check the changes to the timer logic in protocol_worker?

Will also be taking another look at docs...

gterzian avatar Jan 25 '23 09:01 gterzian

@gterzian We are having hard time reviewing the PR as there is some design choices, decisions, that we are not aware of. Maybe we can find some of the information in the discussion, issue/PR but it's hard to follow the flow of a PR opened since 2 months.

Do you mind making a presentation after our weekly meeting on Monday ? So that we are all more aware of the high level context and why you made some design choices (for example keeping tokio for low level networking)

AurelienFT avatar Jan 26 '23 09:01 AurelienFT

Do you mind making a presentation after our weekly meeting on Monday ?

@AurelienFT good idea.

gterzian avatar Jan 26 '23 09:01 gterzian

After the rebase, two tests are still failing(intermittently?). I'm looking into those:

---- tests::in_block_operations_scenarios::test_protocol_does_propagate_operations_received_in_blocks stdout ----
thread 'tokio-runtime-worker' panicked at 'called `Option::unwrap()` on a `None` value', massa-protocol-worker/src/tests/in_block_operations_scenarios.rs:92:26
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'tests::in_block_operations_scenarios::test_protocol_does_propagate_operations_received_in_blocks' panicked at 'called `Result::unwrap()` on an `Err` value: JoinError::Panic(Id(34), ...)', massa-protocol-worker/src/tests/in_block_operations_scenarios.rs:97:18

---- tests::in_block_operations_scenarios::test_protocol_sends_blocks_with_operations_to_consensus stdout ----
thread 'tokio-runtime-worker' panicked at 'called `Option::unwrap()` on a `None` value', massa-protocol-worker/src/tests/in_block_operations_scenarios.rs:203:30
thread 'tests::in_block_operations_scenarios::test_protocol_sends_blocks_with_operations_to_consensus' panicked at 'called `Result::unwrap()` on an `Err` value: JoinError::Panic(Id(41), ...)', massa-protocol-worker/src/tests/in_block_operations_scenarios.rs:208:22

gterzian avatar Feb 02 '23 09:02 gterzian

After the rebase, two tests are still failing(intermittently?).

These are indeed intermittent failure, and I believe they will go away once we do https://github.com/massalabs/massa/issues/3468, so I will leave it as it is for now.

gterzian avatar Feb 02 '23 09:02 gterzian

Thanks for your review @AurelienFT

The goal wasn't to have a thread per connection that manage the connection from handshake to the the disconnect ?

While it may have been part of the discussion, I think the current changes fit better into what is already there, and the current changes are already quite large as it is.

I don't think it's a good idea to have one thread manage the entire lifeline of a connection, because in fact there are multiple concerns such as: the initial handshake, the active node, and so on, and these are best managed separately. But I could discuss a more concrete proposal as a follow-up.

In the per connections threads a connection can't be blocked by a task in the pipeline that has too much load.

I think the current structure also enjoys that benefit, for example each handshake is spawned as a separate task.

gterzian avatar Feb 03 '23 09:02 gterzian

I don't think it's a good idea to have one thread manage the entire lifeline of a connection, because in fact there are multiple concerns such as: the initial handshake, the active node, and so on, and these are best managed separately.

When looking at @damip proposal it's not a single thread of the whole life of a connection but it's a thread by connection.

I think the current structure also enjoys that benefit, for example each handshake is spawned as a separate task.

Good point I misunderstood a part of it.

I don't think I have enough knowledge to know if it's better to have one thread for each "live" connection than your "pipeline" suggestion I just find it more optimize as there is too much message passing and more readable but it's not a strong opinion at all.

AurelienFT avatar Feb 03 '23 14:02 AurelienFT

I don't think I have enough knowledge to know if it's better to have one thread for each "live" connection than your "pipeline" suggestion I just find it more optimize as there is too much message passing and more readable but it's not a strong opinion at all.

Let's discuss this in a follow-up stage, this PR is the minimal first step, which retains a lot of the existing structures(node worker, handshake worker), and adds some minimal new ones(handshake manager).

Also note that there is now one thread per node worker, which is the "active connection".

I'm working on this https://github.com/massalabs/massa/pull/3272#discussion_r1095884149 and after I think we can try to merge.

gterzian avatar Feb 06 '23 09:02 gterzian

Left a comment on the initial discussion https://github.com/massalabs/massa/discussions/3149#discussioncomment-4862894

Eitu33 avatar Feb 06 '23 09:02 Eitu33

after I think we can try to merge.

I think you can look at Thomas's comment before and also did you tested the PR in a network simulated to see if it actually don't break anything ?

AurelienFT avatar Feb 06 '23 10:02 AurelienFT

Could these intermittant failures be related to https://github.com/massalabs/massa/issues/3509?

Ben-PH avatar Feb 10 '23 16:02 Ben-PH

Could these intermittant failures be related to https://github.com/massalabs/massa/issues/3509?

These seem to be different tests, so I don't know. Do you have an idea?

gterzian avatar Feb 13 '23 07:02 gterzian

tested the PR in a network simulated to see if it actually don't break anything

How does this work again? Can you remind me, or point to to the docs? @AurelienFT

gterzian avatar Feb 13 '23 08:02 gterzian

tested the PR in a network simulated to see if it actually don't break anything

How does this work again? Can you remind me, or point to to the docs? @AurelienFT

You can launch a labnet with the python script on private-deployment repository.

AurelienFT avatar Feb 13 '23 11:02 AurelienFT

@AurelienFT is the testnet deployment description valid for the labnet as well? Once it's been deployed you can interact with it throught a client, as described at https://docs.massa.net/en/latest/testnet/running.html ?

gterzian avatar Feb 15 '23 11:02 gterzian

@gterzian Hello,

No you just need to modify the commit in labnet/deploy.py and launch deploy.py @Ben-PH use it for his bootstrap PR. Btw, I sent you an email if you can check it out :)

AurelienFT avatar Feb 15 '23 12:02 AurelienFT

@gterzian

The addresses in labnet deploy are out of date (since the user/sc address enum distinction was merged).

I'll make a PR updating them now.

EDIT: You can use the branch in this PR: https://github.com/massalabs/private-deployment/pull/2/

Ben-PH avatar Feb 15 '23 17:02 Ben-PH

EDIT: You can use the branch in this PR: https://github.com/massalabs/private-deployment/pull/2

Thanks. Are you done using it for bootstrap? Can I deploy this on labnet instead?

gterzian avatar Feb 16 '23 08:02 gterzian

As explained by mail, we will take the work back on this network refactor as we diverged too much on this PR. I place it as draft to keep it as resources for the refactor we will perform in the coming months. Thanks for the work put in place to draft this !

AurelienFT avatar Feb 16 '23 10:02 AurelienFT