rust-libp2p icon indicating copy to clipboard operation
rust-libp2p copied to clipboard

protocols/mdns: Allow users to choose between async-io and tokio runtime

Open gallegogt opened this issue 3 years ago • 18 comments
trafficstars

Problem

When use the Tokio library with mdns module, it is polling every time the UDP socket; causing hight CPU usage.

Solution

When using Tokio library, use the UppSocket provided by the Tokio library instead of the Async-io wrapper.

Breaking Change

After this fix exist two features async-io, tokio for this module, for the general module must be use the features mdns-async-io or mdns-tokio for async-io or tokio libraries

Links to any relevant issues

  • https://github.com/libp2p/rust-libp2p/issues/2675
  • https://github.com/libp2p/rust-libp2p/issues/2591

Change checklist

  • [x] I have performed a self-review of my own code
  • [x] I have made corresponding changes to the documentation
  • [x] I have added tests that prove my fix is effective or that my feature works
  • [x] A changelog entry has been made in the appropriate crates

gallegogt avatar Jul 07 '22 16:07 gallegogt

Hi @thomaseizinge, thanks for the feedback, I will check it

gallegogt avatar Jul 08 '22 14:07 gallegogt

Related issue on if-watch used in libp2p-mdns: https://github.com/mxinden/if-watch/issues/21

//CC @dignifiedquire

mxinden avatar Jul 11 '22 02:07 mxinden

Thanks @gallegogt for the thorough patch and thanks @thomaseizinger for the detailed reviews!

mxinden avatar Jul 11 '22 04:07 mxinden

@gallegogt let us know once this is ready for another review.

mxinden avatar Jul 14 '22 08:07 mxinden

Hi @mxinden it's ready for review

gallegogt avatar Jul 14 '22 12:07 gallegogt

Hi @mxinden @thomaseizinger it's ready for review

gallegogt avatar Jul 18 '22 04:07 gallegogt

The test_expired_tokio test is failing. Can you look into that @gallegogt?

thomaseizinger avatar Jul 20 '22 14:07 thomaseizinger

The test_expired_tokio test is failing. Can you look into that @gallegogt?

I'll check it, but in my test environments work, I will try in other PC.

gallegogt avatar Jul 20 '22 17:07 gallegogt

The test_expired_tokio test is failing. Can you look into that @gallegogt?

I'll check it, but in my test environments work, I will try in other PC.

It also fails on my local computer. I am running Manjaro Linux.

thomaseizinger avatar Jul 20 '22 19:07 thomaseizinger

The test_expired_tokio test is failing. Can you look into that @gallegogt?

I'll check it, but in my test environments work, I will try in other PC.

It also fails on my local computer. I am running Manjaro Linux.

@thomaseizinger , @mxinden My environment:

ARM

Description:    Ubuntu 18.04.6 LTS
Release:        18.04
Codename:       bionic

rustc 1.62.0 (a8314ef7d 2022-06-27)

image_2022-07-21_105521367

gallegogt avatar Jul 21 '22 03:07 gallegogt

Failing on my machine as well:

➜  cargo test --all-features
    Finished test [unoptimized + debuginfo] target(s) in 0.17s
     Running unittests src/lib.rs (/home/mxinden/code/github.com/libp2p/rust-libp2p/target/debug/deps/libp2p_mdns-aeea3dee3d0f3ca4)

running 5 tests
test behaviour::iface::dns::tests::build_service_discovery_response_correct ... ok
test behaviour::iface::dns::tests::build_query_correct ... ok
test behaviour::iface::dns::tests::test_random_string ... ok
test behaviour::iface::query::tests::test_create_mdns_peer ... ok
test behaviour::iface::dns::tests::build_query_response_correct ... ok

test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running tests/use-async-std.rs (/home/mxinden/code/github.com/libp2p/rust-libp2p/target/debug/deps/use_async_std-77e86062dfdf653f)

running 3 tests
test test_discovery_async_std_ipv4 ... ok
test test_discovery_async_std_ipv6 ... ok
test test_expired_async_std ... ok

test result: ok. 3 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 1.03s

     Running tests/use-tokio.rs (/home/mxinden/code/github.com/libp2p/rust-libp2p/target/debug/deps/use_tokio-278cda36b01406d1)

running 3 tests
test test_expired_tokio ... FAILED
test test_discovery_tokio_ipv4 ... ok
test test_discovery_tokio_ipv6 ... ok

failures:

---- test_expired_tokio stdout ----
thread 'test_expired_tokio' panicked at 'called `Result::unwrap()` on an `Err` value: Elapsed(())', protocols/mdns/tests/use-tokio.rs:55:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    test_expired_tokio

test result: FAILED. 2 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 9.15s

error: test failed, to rerun pass '--test use-tokio'

mxinden avatar Jul 22 '22 09:07 mxinden

Failing on my machine as well:


➜  cargo test --all-features

    Finished test [unoptimized + debuginfo] target(s) in 0.17s

     Running unittests src/lib.rs (/home/mxinden/code/github.com/libp2p/rust-libp2p/target/debug/deps/libp2p_mdns-aeea3dee3d0f3ca4)



running 5 tests

test behaviour::iface::dns::tests::build_service_discovery_response_correct ... ok

test behaviour::iface::dns::tests::build_query_correct ... ok

test behaviour::iface::dns::tests::test_random_string ... ok

test behaviour::iface::query::tests::test_create_mdns_peer ... ok

test behaviour::iface::dns::tests::build_query_response_correct ... ok



test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s



     Running tests/use-async-std.rs (/home/mxinden/code/github.com/libp2p/rust-libp2p/target/debug/deps/use_async_std-77e86062dfdf653f)



running 3 tests

test test_discovery_async_std_ipv4 ... ok

test test_discovery_async_std_ipv6 ... ok

test test_expired_async_std ... ok



test result: ok. 3 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 1.03s



     Running tests/use-tokio.rs (/home/mxinden/code/github.com/libp2p/rust-libp2p/target/debug/deps/use_tokio-278cda36b01406d1)



running 3 tests

test test_expired_tokio ... FAILED

test test_discovery_tokio_ipv4 ... ok

test test_discovery_tokio_ipv6 ... ok



failures:



---- test_expired_tokio stdout ----

thread 'test_expired_tokio' panicked at 'called `Result::unwrap()` on an `Err` value: Elapsed(())', protocols/mdns/tests/use-tokio.rs:55:10

note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace





failures:

    test_expired_tokio



test result: FAILED. 2 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 9.15s



error: test failed, to rerun pass '--test use-tokio'

Very rare situation, I guess tokio (explicitly non_blocking) takes longer to find the peers than AsynIO. If so, I increase the expiration time and query request a little

gallegogt avatar Jul 22 '22 12:07 gallegogt

Very rare situation, I guess tokio (explicitly non_blocking) takes longer to find the peers than AsynIO. If so, I increase the expiration time and query request a little

That would be surprising to me, given that the timeout is on the order of seconds, still something worth trying @gallegogt.

mxinden avatar Jul 27 '22 10:07 mxinden

Hi @mxinden and @thomaseizinger can you check if the error persist please ?

gallegogt avatar Jul 27 '22 23:07 gallegogt

Hi @mxinden and @thomaseizinger can you check if the error persist please ?

Yep, CI is still failing.

thomaseizinger avatar Jul 28 '22 07:07 thomaseizinger

I can't reproduce it locally anymore :/

thomaseizinger avatar Jul 28 '22 16:07 thomaseizinger

The weird thing is that if I run cargo test --workspace --all-features -j1 it doesn't give me the error. :/

gallegogt avatar Jul 28 '22 17:07 gallegogt

The weird thing is that if I run cargo test --workspace --all-features -j1 it doesn't give me the error. :/

I think GitHub actions VMs have at least two cores.

Edit: Yep, two cores: https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources

thomaseizinger avatar Jul 28 '22 19:07 thomaseizinger

Hi @thomaseizinger and @mxinden after some research I found that the tokio implementation take some time (about 10s ) to find the peers.

I refactored the expiration test function, where the timeout check is from when the peers are found

Atte.

gallegogt avatar Aug 22 '22 21:08 gallegogt

Hi @thomaseizinger and @mxinden after some research I found that the tokio implementation take some time (about 10s ) to find the peers.

I refactored the expiration test function, where the timeout check is from when the peers are found

Atte.

Nice! Thanks for digging into this. I've kicked off another test run, lets see what CI thinks :)

thomaseizinger avatar Aug 23 '22 17:08 thomaseizinger

It fails again, :(, I'll try again with a 2 cpu in docker.

We keep in touch, thanks

gallegogt avatar Aug 23 '22 20:08 gallegogt

hi @thomaseizinger, I found out that tokio::interval works different than AsyncIO::Timer::interval, the first one fires the tick now and then during the period and the second one fires after the duration, passed as a parameter.

I checked in my fork actions that the lastest changes works

gallegogt avatar Aug 24 '22 22:08 gallegogt

hi @thomaseizinger, I found out that tokio::interval works different than AsyncIO::Timer::interval, the first one fires the tick now and then during the period and the second one fires after the duration, passed as a parameter.

I checked in my fork actions that the lastest changes works

That is quite interesting, thanks for working this out! I kicked CI again :)

I'll give it a review too but currently travelling so please bear with me :D

thomaseizinger avatar Aug 25 '22 06:08 thomaseizinger

@thomaseizinger done with last comments.

gallegogt avatar Sep 01 '22 20:09 gallegogt

@thomaseizinger done with last comments.

Thanks!

Just one note for future PRs, we tend to not force-push on this repository because PRs will be squash-merged in the end so regular merge commits of e.g. master are completely fine :)

thomaseizinger avatar Sep 01 '22 22:09 thomaseizinger

I've kicked CI again and will merge once it passes!

thomaseizinger avatar Sep 01 '22 22:09 thomaseizinger

When use the Tokio library with mdns module, it is polling every time the UDP socket; causing hight CPU usage.

I've just discovered this PR. Did we ever find out what causes this? If not, it seems to me that this PR was done to bypass the issue, not fix it.

tomaka avatar Oct 20 '22 06:10 tomaka

See #2591 for my investigation into this: the short story is that epoll immediately returns.

rkuhn avatar Oct 20 '22 06:10 rkuhn