js-waku icon indicating copy to clipboard operation
js-waku copied to clipboard

addObserver eventually stops observing

Open dmh1974 opened this issue 2 years ago • 24 comments

so far i have only seen this happen with js-waku used in a browser (have not tested node)

..and i am currently using the testnet (since prod has been fritzy)

i can reproduce the results here.. usually stops around the 10min mark...

https://gateway.pinata.cloud/ipfs/QmUS84DM9wLhmXnVf1EB41uoGc4HobTPc9W2CMAUWQRMFs?topic=BTCUSDT

dmh1974 avatar May 19 '22 06:05 dmh1974

fyi, backgrounding tabs exacerbates the issue as forgrounding does not appear to gracefully recover the connection

dmh1974 avatar May 19 '22 06:05 dmh1974

fyi, this may be related to PING/PONG frames on the ws connection.

PING alone is not enough the other end must reply with a PONG frame as well.

also note that PING/PONG frames are not required to keep a ws alive unless one end is depending on it.. the other must recognize a ping frame and respond with a pong frame.

https://stackoverflow.com/questions/46111656/must-websockets-have-heartbeats

dmh1974 avatar May 20 '22 19:05 dmh1974

fyi, addObserver is stable when used with node

dmh1974 avatar May 21 '22 22:05 dmh1974

Thx, will investigate.

D4nte avatar May 23 '22 04:05 D4nte

I suspect that this is the same issue we're having in https://railway.xyz - at some point, clients stop receiving fees from relayers and have to refresh the page to get it going again. @D4nte if you have a workaround we can implement on the frontend in the meantime - eg a timer to check if observers are observing and force reinitialize them if not - I'd love to hear it.

dao avatar May 23 '22 14:05 dao

@dao i have been pondering a workaround myself... one way would be to use the queryHistory function on a timer that keeps track of startdate as a pointer. the nodes also apparently support rpc calls and there are some docs on that, but i havent done a deep dive on it yet.

dmh1974 avatar May 23 '22 20:05 dmh1974

fyi, i discovered this also applies to relay.send when used in a browser. after 10m it too stops working

the issue does not appear to manifest when used in node

dmh1974 avatar May 29 '22 20:05 dmh1974

@dmh1974 were you using port 443 of the fleet (websockify) or 8000 (nim-websocket)?

I know @dao was using native websocket. It seems that we have a bug in nim-websocket that closes connections after 10min. Hopefully this is the issue.

@jm-clius: can you share the issue # please?

D4nte avatar May 30 '22 11:05 D4nte

Note: I was surprised by this bug because we used to have the issue with websockify and/or nginx and I resolved it using keep alive.

D4nte avatar May 30 '22 11:05 D4nte

Yes, this fix in nim-libp2p addresses the 10 min timeout issue for native websocket connections: https://github.com/status-im/nim-libp2p/pull/721 We still need to update the submodule to have an effect in nwaku.

jm-clius avatar May 30 '22 11:05 jm-clius

i am only specifying a connection for bootstrap when invoking create.. i got it from the fleets json (testnet). im not specifying a port anywhere else after that (im assuming the js-waku library handles such)

dmh1974 avatar May 30 '22 11:05 dmh1974

i am only specifying a connection for bootstrap when invoking create.. i got it from the fleets json (testnet). im not specifying a port anywhere else after that (im assuming the js-waku library handles such)

We are doing the very same thing

      logDev(`Creating js-waku client: ${fleet} fleet`);
      const peers = getPredefinedBootstrapNodes(fleet, 2);
      const waku = await Waku.create({
        bootstrap: { peers }
      });

dao avatar May 30 '22 16:05 dao

i am only specifying a connection for bootstrap when invoking create.. i got it from the fleets json (testnet). im not specifying a port anywhere else after that (im assuming the js-waku library handles such)

Ok fleets.status.im currently contains the 8000 port so it means it's nim-websocket. Note that you can work around the issue by replacing 8000 by 443 or simply using the predefined nodes available in js-waku:

https://github.com/status-im/js-waku/blob/6df972694cd9ae948fdaaf4bef16c892b943281d/src/lib/discovery/predefined.ts#L60

Also note that we discourage using fleets.status.im. This is a just a dev tool that can go down at any time.

D4nte avatar May 31 '22 00:05 D4nte

i am only specifying a connection for bootstrap when invoking create.. i got it from the fleets json (testnet). im not specifying a port anywhere else after that (im assuming the js-waku library handles such)

We are doing the very same thing

      logDev(`Creating js-waku client: ${fleet} fleet`);
      const peers = getPredefinedBootstrapNodes(fleet, 2);
      const waku = await Waku.create({
        bootstrap: { peers }
      });

Ok that's not great because it means you are using websockify (port 443) and getting the timeout. I'll need to investigate.

D4nte avatar May 31 '22 00:05 D4nte

ok gents, happy to report send and receive have been stable in two separate tabs for over 30mins now (also surviving tab backgrounding)

not sure what changed on your end, but keep up the good work! (maybe mem leak reboots?)

current env...

Waku.create({bootstrap: {peers: ["/dns4/node-01.gc-us-central1-a.wakuv2.test.statusim.net/tcp/8000/wss/p2p/16Uiu2HAmJb2e28qLXxT5kZxVUUoJt72EMzNGXB47Rxx5hw3q4YjS",]},

dmh1974 avatar Jun 01 '22 15:06 dmh1974

@dmh1974 excellent news! It was a fix in an underlying library that caused timeouts in websocket connections. Glad to hear it's working as expected.

jm-clius avatar Jun 01 '22 16:06 jm-clius

one important note... the connection does not survive 'sleeping'... if i reopen my laptop after some time, the connection does not re-establish gracefully. the page needs refreshing.

dmh1974 avatar Jun 02 '22 00:06 dmh1974

@jm-clius - Fantastic! Is the fix available in js-waku 0.24?

@dmh1974 - In you testing, is there a variable on Waku that we can watch to know "status" - whether connected or disconnected"? We're considering a status indicator that could show on the frontend.

therealjmj avatar Jun 07 '22 13:06 therealjmj

@jm-clius - Fantastic! Is the fix available in js-waku 0.24?>

The timeout was on nwaku's side (e.g. on the fleets), has been fixed on the latest master and is already running on the wakuv2.test fleet. It will also be included in the upcoming nwaku release (v0.10), scheduled for middle of next week.

jm-clius avatar Jun 07 '22 13:06 jm-clius

@jrmeurer i havent.. hopefully that will be handled gracefully by js-waku in the future.

dmh1974 avatar Jun 08 '22 00:06 dmh1974

@jm-clius i know its all related, but just for completion sake I'll mention on this ticket that even currently with observers stable, there is still a degree of message loss

dmh1974 avatar Jun 08 '22 00:06 dmh1974

This might be a fix to this issue: https://github.com/status-im/status-web/pull/288/files

D4nte avatar Jul 25 '22 11:07 D4nte

This might be a fix to this issue: https://github.com/status-im/status-web/pull/288/files

Good potential. Will this change be added to js-waku?

therealjmj avatar Jul 25 '22 13:07 therealjmj

This might be a fix to this issue: https://github.com/status-im/status-web/pull/288/files

Good potential. Will this change be added to js-waku?

I have encouraged it to be pushed directly upstream in the right libp2p dependency which seems to be the right thing to do 🙂

D4nte avatar Jul 25 '22 21:07 D4nte

Maybe keep this issue open until the problem is resolved - we have been researching why we haz this err and it's useful to see this issue

lucille-bellepleure avatar Oct 27 '22 23:10 lucille-bellepleure

Maybe keep this issue open until the problem is resolved - we have been researching why we haz this err and it's useful to see this issue

Hi @michelleplur, yes the process is to keep issues open until the problem is resolved.

Note that the final solution would be to backport the quoted patch to upstream libp2p dependency.

This is next on the list after the refactoring/doc update is done: https://github.com/waku-org/js-waku/issues/802 I haven't done it before the ESM migration because:

  • I was hoping that bumping to latest libp2p libs could have solved the issue (it did not)
  • The proper fix is to patch upstream libp2p lib, no point doing so until we are using the latest libp2p dep, which was blocked on ESM migration as libp2p is ESM only now.

Let me know if you have any question.

fryorcraken avatar Nov 01 '22 00:11 fryorcraken


11:08:38.658 libp2p:websockets dialing /dns4/node-01.ac-cn-hongkong-c.wakuv2.prod.statusim.net/tcp/443/wss/p2p/16Uiu2HAm4v86W3bmT1BiH6oSPzcsSr24iDQpSN5Qa992BCjjwgrD +0ms

...

11:18:27.930 libp2p:dialer token 7 released +9ms common.js:108
11:18:27.931 libp2p:dialer dial failed to 16Uiu2HAm4v86W3bmT1BiH6oSPzcsSr24iDQpSN5Qa992BCjjwgrD +0ms AggregateError: No Promise in Promise.any was resolved
    run dial-request.js:56
    _createPendingDial index.js:208
    dial index.js:98
    openConnection index.js:344
    _autoDial auto-dialler.js:92
    retimer bundle.js:129007
    setTimeout handler*Retimer retimer.js:21
    retimer retimer.js:65
    _autoDial auto-dialler.js:102
    retimer bundle.js:129007
    setTimeout handler*Retimer retimer.js:21
    retimer retimer.js:65
    _autoDial auto-dialler.js:102
    retimer bundle.js:129007
    setTimeout handler*Retimer retimer.js:21
    retimer retimer.js:65
    _autoDial auto-dialler.js:102
    retimer bundle.js:129007
    setTimeout handler*Retimer retimer.js:21
    retimer retimer.js:65
    _autoDial auto-dialler.js:102
    retimer bundle.js:129007
    setTimeout handler*Retimer retimer.js:21
    retimer retimer.js:65
    _autoDial auto-dialler.js:102
    retimer bundle.js:129007
    setTimeout handler*Retimer retimer.js:21
    retimer retimer.js:65
    _autoDial auto-dialler.js:102
    retimer bundle.js:129007
    setTimeout handler*Retimer retimer.js:21
    retimer retimer.js:65
    _autoDial auto-dialler.js:102
common.js:108
11:18:27.931 libp2p:dialer:error Promise.any*run@http://localhost:3004/eth-pm/static/js/bundle.js:164581:28
_createPendingDial@http://localhost:3004/eth-pm/static/js/bundle.js:164874:28
dial@http://localhost:3004/eth-pm/static/js/bundle.js:164764:70
async*openConnection@http://localhost:3004/eth-pm/static/js/bundle.js:165310:55
_autoDial@http://localhost:3004/eth-pm/static/js/bundle.js:164492:51
Async*./node_modules/.pnpm/[email protected]/node_modules/retimer/retimer.js/Retimer/this._timerWrapper@http://localhost:3004/eth-pm/static/js/bundle.js:129007:18
setTimeout handler*Retimer@http://localhost:3004/eth-pm/static/js/bundle.js:129010:19
retimer@http://localhost:3004/eth-pm/static/js/bundle.js:129054:10
_autoDial@http://localhost:3004/eth-pm/static/js/bundle.js:164502:28
Async*./node_modules/.pnpm/[email protected]/node_modules/retimer/retimer.js/Retimer/this._timerWrapper@http://localhost:3004/eth-pm/static/js/bundle.js:129007:18
setTimeout handler*Retimer@http://localhost:3004/eth-pm/static/js/bundle.js:129010:19
retimer@http://localhost:3004/eth-pm/static/js/bundle.js:129054:10
_autoDial@http://localhost:3004/eth-pm/static/js/bundle.js:164502:28
Async*./node_modules/.pnpm/[email protected]/node_modules/retimer/retimer.js/Retimer/this._timerWrapper@http://localhost:3004/eth-pm/static/js/bundle.js:129007:18
setTimeout handler*Retimer@http://localhost:3004/eth-pm/static/js/bundle.js:129010:19
retimer@http://localhost:3004/eth-pm/static/js/bundle.js:129054:10
_autoDial@http://localhost:3004/eth-pm/static/js/bundle.js:164502:28
Async*./node_modules/.pnpm/[email protected]/node_modules/retimer/retimer.js/Retimer/this._timerWrapper@http://localhost:3004/eth-pm/static/js/bundle.js:129007:18
setTimeout handler*Retimer@http://localhost:3004/eth-pm/static/js/bundle.js:129010:19
retimer@http://localhost:3004/eth-pm/static/js/bundle.js:129054:10
_autoDial@http://localhost:3004/eth-pm/static/js/bundle.js:164502:28
Async*./node_modules/.pnpm/[email protected]/node_modules/retimer/retimer.js/Retimer/this._timerWrapper@http://localhost:3004/eth-pm/static/js/bundle.js:129007:18
setTimeout handler*Retimer@http://localhost:3004/eth-pm/static/js/bundle.js:129010:19
retimer@http://localhost:3004/eth-pm/static/js/bundle.js:129054:10
_autoDial@http://localhost:3004/eth-pm/static/js/bundle.js:164502:28
Async*./node_modules/.pnpm/[email protected]/node_modules/retimer/retimer.js/Retimer/this._timerWrapper@http://localhost:3004/eth-pm/static/js/bundle.js:129007:18
setTimeout handler*Retimer@http://localhost:3004/eth-pm/static/js/bundle.js:129010:19
retimer@http://localhost:3004/eth-pm/static/js/bundle.js:129054:10
_autoDial@http://localhost:3004/eth-pm/static/js/bundle.js:164502:28
Async*./node_modules/.pnpm/[email protected]/node_modules/retimer/retimer.js/Retimer/this._timerWrapper@http://localhost:3004/eth-pm/static/js/bundle.js:129007:18
setTimeout handler*Retimer@http://localhost:3004/eth-pm/static/js/bundle.js:129010:19
retimer@http://localhost:3004/eth-pm/static/js/bundle.js:129054:10
_autoDial@http://localhost:3004/eth-pm/static/js/bundle.js:164502:28
 +1ms common.js:108
11:18:27.931 libp2p:connection-manager:auto-dialler:error could not connect to peerStore stored peer +10s AggregateError: No Promise in Promise.any was resolved
    run dial-request.js:56
    _createPendingDial index.js:208
    dial index.js:98
    openConnection index.js:344
    _autoDial auto-dialler.js:92
    retimer bundle.js:129007
    setTimeout handler*Retimer retimer.js:21
    retimer retimer.js:65
    _autoDial auto-dialler.js:102
    retimer bundle.js:129007
    setTimeout handler*Retimer retimer.js:21
    retimer retimer.js:65
    _autoDial auto-dialler.js:102
    retimer bundle.js:129007
    setTimeout handler*Retimer retimer.js:21
    retimer retimer.js:65
    _autoDial auto-dialler.js:102
    retimer bundle.js:129007
    setTimeout handler*Retimer retimer.js:21
    retimer retimer.js:65
    _autoDial auto-dialler.js:102
    retimer bundle.js:129007
    setTimeout handler*Retimer retimer.js:21
    retimer retimer.js:65
    _autoDial auto-dialler.js:102
    retimer bundle.js:129007
    setTimeout handler*Retimer retimer.js:21
    retimer retimer.js:65
    _autoDial auto-dialler.js:102
    retimer bundle.js:129007
    setTimeout handler*Retimer retimer.js:21
    retimer retimer.js:65
    _autoDial auto-dialler.js:102
common.js:108
11:18:37.932 libp2p:peer-store:trace getPeers await read lock +10s common.js:108
11:18:37.932 libp2p:peer-store:trace getPeers got read lock +0ms common.js:108
11:18:37.934 libp2p:peer-store:trace getPeers release read lock +2ms common.js:108
11:18:37.934 libp2p:connection-manager:auto-dialler connecting to a peerStore stored peer 16Uiu2HAm4v86W3bmT1BiH6oSPzcsSr24iDQpSN5Qa992BCjjwgrD +10s common.js:108
11:18:37.934 libp2p:connection-manager dial to 16Uiu2HAm4v86W3bmT1BiH6oSPzcsSr24iDQpSN5Qa992BCjjwgrD +10s common.js:108
11:18:37.934 libp2p:dialer check multiaddrs 16Uiu2HAm4v86W3bmT1BiH6oSPzcsSr24iDQpSN5Qa992BCjjwgrD +10s common.js:108
11:18:37.935 libp2p:dialer creating dial target for 16Uiu2HAm4v86W3bmT1BiH6oSPzcsSr24iDQpSN5Qa992BCjjwgrD +0ms common.js:108
11:18:37.935 libp2p:peer-store:address-book:trace get wait for read lock +10s common.js:108
11:18:37.935 libp2p:peer-store:address-book:trace get got read lock +1ms common.js:108
11:18:37.936 libp2p:peer-store:address-book:trace get release read lock +0ms common.js:108
11:18:37.938 libp2p:dialer 1 tokens request, returning 1, 99 remaining +3ms common.js:108
11:18:37.939 libp2p:websockets dialing /ip4/0.0.0.0/tcp/8000/wss/p2p/16Uiu2HAm4v86W3bmT1BiH6oSPzcsSr24iDQpSN5Qa992BCjjwgrD +10s common.js:108
11:18:37.939 libp2p:websockets dialing 0.0.0.0:8000 +0ms common.js:108
11:18:37.944 Firefox can’t establish a connection to the server at wss://0.0.0.0:8000/p2p/16Uiu2HAm4v86W3bmT1BiH6oSPzcsSr24iDQpSN5Qa992BCjjwgrD. client.js:9
11:18:37.945 libp2p:websockets:error connection error: +10s 

Autodial actually happens when the connection is lost after 10min but the wrong multiaddr is used:

11:20:34.350 libp2p:connection-manager dial to 16Uiu2HAm4v86W3bmT1BiH6oSPzcsSr24iDQpSN5Qa992BCjjwgrD +36s [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/eth-pm/node_modules/.pnpm/[email protected]/node_modules/debug/src/common.js)
11:20:34.351 libp2p:dialer check multiaddrs 16Uiu2HAm4v86W3bmT1BiH6oSPzcsSr24iDQpSN5Qa992BCjjwgrD +10s [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/eth-pm/node_modules/.pnpm/[email protected]/node_modules/debug/src/common.js)
11:20:34.351 libp2p:dialer creating dial target for 16Uiu2HAm4v86W3bmT1BiH6oSPzcsSr24iDQpSN5Qa992BCjjwgrD +0ms [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/eth-pm/node_modules/.pnpm/[email protected]/node_modules/debug/src/common.js)
11:20:34.351 libp2p:peer-store:address-book:trace get wait for read lock +36s [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/eth-pm/node_modules/.pnpm/[email protected]/node_modules/debug/src/common.js)
11:20:34.352 libp2p:peer-store:address-book:trace get got read lock +1ms [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/eth-pm/node_modules/.pnpm/[email protected]/node_modules/debug/src/common.js)
11:20:34.352 libp2p:peer-store:address-book:trace get release read lock +0ms [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/eth-pm/node_modules/.pnpm/[email protected]/node_modules/debug/src/common.js)
11:20:34.356 libp2p:dialer 1 tokens request, returning 1, 99 remaining +5ms [common.js:108](http://localhost:3004/home/fryorcraken/src/waku-org/js-waku-examples/eth-pm/node_modules/.pnpm/[email protected]/node_modules/debug/src/common.js)
11:20:34.356 libp2p:websockets dialing /ip4/0.0.0.0/tcp/8000/wss/p2p/16Uiu2HAm4v86W3bmT1BiH6oSPzcsSr24iDQpSN5Qa992BCjjwgrD +36

Need to investigate why the multiaddr is not being stored in the peer store.

fryorcraken avatar Nov 21 '22 00:11 fryorcraken

investigating and reported to libp2p too https://github.com/libp2p/js-libp2p/issues/1484

fryorcraken avatar Nov 21 '22 03:11 fryorcraken

Issue found: https://github.com/libp2p/js-libp2p/issues/1484#issuecomment-1321487991 upstream fix is being tested.

fryorcraken avatar Nov 21 '22 05:11 fryorcraken

Upstream fix proposed: https://github.com/libp2p/js-libp2p/pull/1485

Note that latest js-libp2p automatically attempt to reconnect to the remote node when the connection drops. Which means that the workaround https://github.com/status-im/status-web/pull/288/files is not needed any more.

This is done in the form of the autodialer: https://github.com/libp2p/js-libp2p/blob/a3847f2d1725b1c92d5e0ef7bcdf840ea8428a75/src/connection-manager/auto-dialler.ts which only dials when the number of connections is under a threshold.

Unfortunately, an other issue was corrupting the address book after the identify protocol is executed. Which is the upstream fix.

One work around of the latest issue would be to do addressBook.add after identify is done so that the overridden value can be re-instated.

fryorcraken avatar Nov 21 '22 06:11 fryorcraken