js-libp2p icon indicating copy to clipboard operation
js-libp2p copied to clipboard

`.stop()` sometimes hangs indefinitely

Open marcus-pousette opened this issue 1 year ago • 8 comments

  • Version: 1.8.1

  • Subsystem: @libp2p/websocket , connection manager, components

Severity:

High

Description:

Steps to reproduce the error:

I have a test scenario with a few nodes that I connect to each other and send some data

Sometimes when terminating a node it endlessly waits for the Websocket transport to shutdown. This is because there are connections that does close (?) and in the it-ws library there are no calls to server.closeAllConnections() before close.

I have managed to make the shutdown work better if modify https://github.com/libp2p/js-libp2p/blob/73f2b6b6d767492f8f1d740fac382a2d22c3fca1/packages/libp2p/src/components.ts#L72 to terminate the components in the reverse order in sequentially... so I suspect there is some kind of race condition going on or leak of some sort

Also no issues if I use Tcp transport instead.

This is not a good description on how to reproduce the issue, because it is a bit random when it occurs. But wanted to create this issue if others have the same problem and will update the description if I have a isolated a good test

marcus-pousette avatar Jul 18 '24 13:07 marcus-pousette

When the stop timeouts, I have one connection left with 0 streams, remoteAddr: '/ip6/::ffff:7f00:1/tcp/56327/p2p/12D3KooWApN6mEB5666wTpqpzkgcKXC7kgMxmm7CAWRiD3WAg4Yv' direction: 'inbound' encryption: '/noise' multiplexer: '/yamux/1.0.0' transient: false

marcus-pousette avatar Jul 18 '24 13:07 marcus-pousette

Closing the connectionManager component before all other components seems also to fix the issue


async _invokeStartableMethod(methodName) {
        const timeout = new Promise((resolve, reject) => {
            setTimeout(() => {
                reject('timeout')
            }, 10000);
        });
        const connectionManager = this.components["connectionManager"]
        if (isStartable(connectionManager)) {
            await connectionManager[methodName]?.();
        }
        let promise = Promise.all(Object.values(this.components)
            .filter(obj => isStartable(obj))
            .map(async (startable) => {
                await startable[methodName]?.();
            }));
        const race = Promise.race([promise, timeout]);
        try {
            await race;
        }
        catch (e) {
            throw e;
        }
        finally {
            clearTimeout(timeout);
        }
    }

marcus-pousette avatar Jul 18 '24 13:07 marcus-pousette

in the it-ws library there are no calls to server.closeAllConnections() before close.

I may be missing something but where is the closeAllConnections method defined? The it-ws module stores a reference to a http/https net.Server which has no such method.

no issues if I use Tcp transport

Looking at the code, the TCP listener aborts each connection on close whereas the WebSocket transport tries to close them gracefully.

Do you still see the issue if the transport aborts the connections instead of closing them on close?

achingbrain avatar Jul 18 '24 14:07 achingbrain

http.Server and https.Server has that method since NodeJs 18.2.0

/// it-ws/server.ts

 async close (): Promise<void> {
    await new Promise<void>((resolve, reject) => {
      this.server.closeAllConnections()
      this.server.close((err) => {
        if (err != null) {
          reject(err); return
        }

        resolve()
      })
    })
  }
  

marcus-pousette avatar Jul 18 '24 14:07 marcus-pousette

Actually the closeAllConnections trick does work since the connection is a tcp one (?)

The close connection manager first trick seems to always work.

Also adding like a 5s delay before stopping also seems to work.

It feels like there is some kind of connection creation in process that is not caught correctly on closing

marcus-pousette avatar Jul 18 '24 14:07 marcus-pousette

http.Server and https.Server has that method since NodeJs 18.2.0

Aha, that's interesting, nice find.

The docs do say though:

This does not destroy sockets upgraded to a different protocol, such as WebSocket or HTTP/2

...so I wonder if there's something else going on here?

achingbrain avatar Jul 18 '24 15:07 achingbrain

Ye the whole thing about aborting Websocket connections before closing does not seem to have an impact. Only the two things I mentioned in my last comment did mitigate the problem for me

marcus-pousette avatar Jul 18 '24 18:07 marcus-pousette

Looking at the code, the TCP listener aborts each connection on close whereas the WebSocket transport tries to close them gracefully.

That would explain something!

marcus-pousette avatar Jul 18 '24 18:07 marcus-pousette

I'm running a node with this configuration, and I keep having the hanging problem:

const node = await createLibp2p({
    transports: [webRTCDirect(), webRTC(), circuitRelayTransport()],
    services: {
      ping: ping(),
      identify: identify(),
    },
    streamMuxers: [yamux()],
  })

So, in my script I connect to a Go node running on the same machine using WebRTC transport, and I call ping a few times. After that, I do something like this:

async main() {
    try {
        // Create node, connect and ping.
        // ...
    } finally {
        console.log('STOPPING', node.status)
        await node.stop()
        console.log('STOPPED', node.status)
    }
    console.log('RETURNING', node.status)
    return 
}

main()

and I see all 3 logs, I see that the node moves from started to stopped state, but the script sometimes is left hanging. I tried to inspect the active requests and handles of the runtime, but they are empty. It's really weird.

And stopping the connection manager explicitly doesn't help either — the script still sometimes hangs.

burdiyan avatar Apr 09 '25 13:04 burdiyan

Can you try adding why-is-node-running to your script? It's invaluable at tracking down why node doesn't exit.

import why from 'why-is-node-running'

// ... more imports

async main() {
  try {
    // Create node, connect and ping.
    // ...
  } finally {
    // print all open handles after 5s - `.unref` the timeout so it doesn't keep the process running itself
    setTimeout(() => {
      why()
    }, 5_000).unref()

    console.log('STOPPING', node.status)
    await node.stop()
    console.log('STOPPED', node.status)
  }

  console.log('RETURNING', node.status)
  return 
}

achingbrain avatar Apr 09 '25 16:04 achingbrain

Oh, thanks! Didn't know such thing exists (I'm not a JS developer in general, just had to dig into js-libp2p this time).

burdiyan avatar Apr 10 '25 10:04 burdiyan

This is what I get. Not sure where to start digging, but hope it could be useful to resolve the issue:

There are 7 handle(s) keeping the process running.

# KEYPAIRGENREQUEST
@libp2p/crypto/dist/src/keys/ed25519/index.js:31 - const key = keypair('ed25519', {
@libp2p/crypto/dist/src/keys/ed25519/utils.js:22 - const { privateKey, publicKey } = crypto.generateKey();
@libp2p/crypto/dist/src/keys/index.js:19         - return generateEd25519KeyPair();
libp2p/dist/src/index.js:47                      - options.privateKey ??= await generateKeyPair('Ed25519');
src/run.mjs:36                                                                                                         - const node = await createLibp2p({
src/run.mjs:97                                                                                                         - await main()

# KEYPAIRGENREQUEST
@peculiar/webcrypto/build/webcrypto.js:1546                                 - const keys = crypto.generateKeyPairSync("ec", {
@peculiar/webcrypto/build/webcrypto.js:1729                                 - const keys = await EcCrypto.generateKey({
webcrypto-core/build/webcrypto-core.js:131                                  - return this.onGenerateKey.apply(this, args);
webcrypto-core/build/webcrypto-core.js:1435                                 - const result = await provider.generateKey({ ...preparedAlgorithm, name: provider.name }, extractable, keyUsages, ...params);
@libp2p/webrtc/dist/src/private-to-public/utils/get-rtcpeerconnection.js:62 - const keyPair = await crypto.subtle.generateKey({
@libp2p/webrtc/dist/src/private-to-public/transport.js:74                   - const peerConnection = await createDialerRTCPeerConnection('client', ufrag, typeof this.init.rtcConfiguration === 'function' ? await this.init.rtcConfiguration() : this.init.rtcConfiguration ?? {});

# ThreadSafeCallback callback
@ipshipyard/node-datachannel/dist/esm/polyfill/RTCDataChannel.mjs:45     - __privateGet(this, _dataChannel).onClosed(() => {
@ipshipyard/node-datachannel/dist/esm/polyfill/RTCPeerConnection.mjs:282 - const dataChannel = new RTCDataChannel(channel, opts);
@libp2p/webrtc/dist/src/private-to-public/utils/connect.js:15            - const handshakeDataChannel = peerConnection.createDataChannel('', { negotiated: true, id: 0 });
@libp2p/webrtc/dist/src/private-to-public/transport.js:76                - return await raceSignal(connect(peerConnection, ufrag, {
@libp2p/webrtc/dist/src/private-to-public/transport.js:36                - const rawConn = await this._connect(ma, options);
libp2p/dist/src/connection-manager/dial-queue.js:210                     - const conn = await this.components.transportManager.dial(address.multiaddr, {
libp2p/dist/src/connection-manager/dial-queue.js:147                     - return await this.dialPeer(options, signal);
race-signal/dist/src/index.js:31                                         - return await Promise.race([

# KEYPAIRGENREQUEST
@chainsafe/libp2p-noise/dist/src/crypto/index.js:116          - const { publicKey, privateKey } = crypto.generateKeyPairSync('x25519', {
@chainsafe/libp2p-noise/dist/src/noise.js:40                  - this.staticKey = _crypto.generateX25519KeyPair();
@chainsafe/libp2p-noise/dist/src/index.js:41                  - return (components) => new Noise(components, init);
@libp2p/webrtc/dist/src/private-to-public/utils/connect.js:70 - const connectionEncrypter = noise({ prologueBytes: noisePrologue })(options);
race-signal/dist/src/index.js:31                              - return await Promise.race([
@libp2p/webrtc/dist/src/private-to-public/transport.js:76     - return await raceSignal(connect(peerConnection, ufrag, {

# SIGNREQUEST
@libp2p/crypto/dist/src/keys/ed25519/index.js:87               - return crypto.sign(null, msg instanceof Uint8Array ? msg : msg.subarray(), obj);
@libp2p/crypto/dist/src/keys/ed25519/ed25519.js:51             - return crypto.hashAndSign(this.raw, message);
@chainsafe/libp2p-noise/dist/src/utils.js:8                    - const identitySig = await privateKey.sign(getSignaturePayload(staticPublicKey));
@chainsafe/libp2p-noise/dist/src/performHandshake.js:40        - const payload = await createHandshakePayload(privateKey, s.publicKey, extensions);
@chainsafe/libp2p-noise/dist/src/noise.js:154                  - result = await performHandshakeResponder({
@chainsafe/libp2p-noise/dist/src/noise.js:106                  - const handshake = await this.performHandshakeResponder(wrappedConnection, this.components.privateKey, options?.remotePeer?.publicKey, options);
@libp2p/webrtc/dist/src/private-to-public/utils/connect.js:123 - await connectionEncrypter.secureInbound(wrappedDuplex, {

# KEYPAIRGENREQUEST
@chainsafe/libp2p-noise/dist/src/crypto/index.js:116    - const { publicKey, privateKey } = crypto.generateKeyPairSync('x25519', {
@chainsafe/libp2p-noise/dist/src/protocol.js:98         - const e = this.crypto.generateKeypair();
@chainsafe/libp2p-noise/dist/src/protocol.js:200        - const e = this.writeE();
@chainsafe/libp2p-noise/dist/src/performHandshake.js:54 - await connection.write(xx.writeMessageB(payload), options);
@chainsafe/libp2p-noise/dist/src/noise.js:154           - result = await performHandshakeResponder({
@chainsafe/libp2p-noise/dist/src/noise.js:106           - const handshake = await this.performHandshakeResponder(wrappedConnection, this.components.privateKey, options?.remotePeer?.publicKey, options);

# SIGNREQUEST
@libp2p/crypto/dist/src/keys/ed25519/index.js:110              - return crypto.verify(null, msg instanceof Uint8Array ? msg : msg.subarray(), obj, sig);
@libp2p/crypto/dist/src/keys/ed25519/ed25519.js:31             - return crypto.hashAndVerify(this.raw, sig, data);
@chainsafe/libp2p-noise/dist/src/utils.js:26                   - if (!(await publicKey.verify(signaturePayload, payload.identitySig))) {
@chainsafe/libp2p-noise/dist/src/performHandshake.js:60        - const receivedPayload = await decodeHandshakePayload(plaintext, xx.rs, remoteIdentityKey);
@chainsafe/libp2p-noise/dist/src/noise.js:154                  - result = await performHandshakeResponder({
@chainsafe/libp2p-noise/dist/src/noise.js:106                  - const handshake = await this.performHandshakeResponder(wrappedConnection, this.components.privateKey, options?.remotePeer?.publicKey, options);
@libp2p/webrtc/dist/src/private-to-public/utils/connect.js:123 - await connectionEncrypter.secureInbound(wrappedDuplex, {

burdiyan avatar Apr 10 '25 12:04 burdiyan

I tried to closing various components manually, making sure to close each connection explicitly, making sure to closing all the streams on the connection before closing them, etc. Nothing seems to help — sometimes the program hangs anyway.

burdiyan avatar Apr 10 '25 12:04 burdiyan

The # KEYPAIRGENREQUEST and # SIGNREQUEST stuff is safe to ignore - it's probably the # ThreadSafeCallback callback - it looks like a dial has hung and/or is still in progress and it's not being torn down properly.

achingbrain avatar Apr 10 '25 14:04 achingbrain

Hm... interesting. I only dial one peer in this script, and then I make sure I close the connection explicitly before stopping the node (I also tried closing the connection after stopping the node, which is weird, but who knows). I'll check if there's something stuck in the dial queue.

burdiyan avatar Apr 10 '25 14:04 burdiyan

The dial queue is empty when the program is getting stuck.

I also tried passing custom abort signal to all the components and method calls that accept it, and then making sure I abort after stopping the libp2p node. Didn't help either.

burdiyan avatar Apr 10 '25 15:04 burdiyan

Another curious thing which may or may not be relevant is that when I run my script with node it hangs much more often than if I run it with bun.

burdiyan avatar Apr 10 '25 15:04 burdiyan

Another observation: using TCP transport I don't have this problem — seems to only be happening with the WebRTC Direct transport. I can't use TCP though because I'm planning to run this in the browser (even though I'm testing locally with NodeJS for faster feedback loop).

burdiyan avatar Apr 10 '25 15:04 burdiyan

seems to only be happening with the WebRTC Direct transport

Yeah, that's to be expected - the ThreadSafeCallback reference that's keeping the process running is part of WebRTCDirect (e.g. private to public):

# ThreadSafeCallback callback
@ipshipyard/node-datachannel/dist/esm/polyfill/RTCDataChannel.mjs:45     - __privateGet(this, _dataChannel).onClosed(() => {
@ipshipyard/node-datachannel/dist/esm/polyfill/RTCPeerConnection.mjs:282 - const dataChannel = new RTCDataChannel(channel, opts);
@libp2p/webrtc/dist/src/private-to-public/utils/connect.js:15            - const handshakeDataChannel = peerConnection.createDataChannel('', { negotiated: true, id: 0 });
@libp2p/webrtc/dist/src/private-to-public/transport.js:76                - return await raceSignal(connect(peerConnection, ufrag, {
@libp2p/webrtc/dist/src/private-to-public/transport.js:36                - const rawConn = await this._connect(ma, options);
libp2p/dist/src/connection-manager/dial-queue.js:210                     - const conn = await this.components.transportManager.dial(address.multiaddr, {
libp2p/dist/src/connection-manager/dial-queue.js:147                     - return await this.dialPeer(options, signal);
race-signal/dist/src/index.js:31                                         - return await Promise.race([

Is the node you are dialing public? If so can you share the multiaddr of the peer you are trying to dial?

achingbrain avatar Apr 10 '25 15:04 achingbrain

Is the node you are dialing public? If so can you share the multiaddr of the peer you are trying to dial?

No, I'm dialing a local node running on the same machine. I'll try to dial some public node to see if it keeps happening.

burdiyan avatar Apr 10 '25 15:04 burdiyan

I can't replicate this locally - what OS/node version are you using?

achingbrain avatar Apr 10 '25 16:04 achingbrain

I tried dialing 12D3KooWKnDdG3iXw9eTFijk3EWSunZcFi54Zka4wmtqtt6rPxc8 on /ip4/40.160.9.115/udp/4005/webrtc-direct/certhash/uEiDvpLLt7lCMQ_BD2XcXoubpigiSCwMZ_y-d5IGl2NfVoA and it keeps happening, but very randomly.

To reproduce try running a Kubo node. I'm running Kubo v0.34.1.

And here's the cleaned up script that I hope is useful. I'm running it with Node v20.12.2 (pretty old, yeah, but same problem happens with Bun too, so would guess it's not Node-specific).

import {createLibp2p} from 'libp2p'
import {webRTC, webRTCDirect} from '@libp2p/webrtc'
import {circuitRelayTransport} from '@libp2p/circuit-relay-v2'
import {ping} from '@libp2p/ping'
import {multiaddr} from '@multiformats/multiaddr'
import {identify} from '@libp2p/identify'
import {peerIdFromString} from '@libp2p/peer-id'
import {yamux} from '@chainsafe/libp2p-yamux'
import why from 'why-is-node-running'

export async function main() {
  // Replace with peer info of your locally running Kubo node.
  const peerInfo = {
    id: '12D3KooWH7RY9kkwqUAn4vzrtes5D9oFf48J3nQDRNHm81uqEdMJ',
    addrs: [
      '/ip4/127.0.0.1/udp/4001/webrtc-direct/certhash/uEiATWFB-N5p3HLeAetDJ6G4gL3eSCQ8_bn1oeWMjird2ng/p2p/12D3KooWH7RY9kkwqUAn4vzrtes5D9oFf48J3nQDRNHm81uqEdMJ',
    ],
  }

  const node = await createLibp2p({
    start: false,
    transports: [webRTCDirect(), webRTC(), circuitRelayTransport()],
    services: {
      ping: ping(),
      identify: identify(),
    },
    streamMuxers: [yamux()],
  })

  const pid = peerIdFromString(peerInfo.id)

  await node.peerStore.merge(pid, {
    multiaddrs: peerInfo.addrs.map((v) => multiaddr(v)),
  })

  await node.start()

  try {
    console.log(await node.services.ping.ping(pid))
    console.log(await node.services.ping.ping(pid))

    const info = await node.peerStore.get(pid)
    console.log('PROTOCOLS', info.protocols)
  } finally {
    // print all open handles after 5s - `.unref` the timeout so it doesn't keep the process running itself
    setTimeout(() => {
      why()
    }, 5_000).unref()

    console.log('STOPPING', node.status)
    await node.stop()
    console.log('STOPPED', node.status)
  }

  console.log('RETURNING', node.status)
  return
}

process.on('exit', (code) => {
  console.log('Process exiting with code', code)
})

async function realMain() {
  await main()
  console.log('MAIN DONE')
}

realMain()

burdiyan avatar Apr 10 '25 16:04 burdiyan

Thanks for that - I've replicated the issue locally. Can you try patching your local copy of @libp2p/webrtc in node_modules with the changes from https://github.com/libp2p/js-libp2p/pull/3076/files?w=1 ?

It basically just wraps the use of the handshake datachannel in a try/finally and explicitly closes the channel once the handshake is complete or has failed.

I'm not seeing the hang any more with the fix (though sometimes there's an EOF where the remote closes the channel during the noise handshake, but it doesn't keep the process running).

achingbrain avatar Apr 10 '25 17:04 achingbrain

@achingbrain Yeah, this seems to help. Thanks! Hope the PR won't take long to get accepted :)

burdiyan avatar Apr 10 '25 20:04 burdiyan

@achingbrain Before the PR is accepted, do you think there's any way to close that leaking channel from the outside? Maybe some obscure API that would let you close the handle that prevents the program to exit?

burdiyan avatar Apr 14 '25 12:04 burdiyan

There's no secret API that allows access to the channel. The strange thing is that the RTCPeerConnection is closed when the node is stopped - at this point any open datachannels should get closed too but for a reason that needs figuring out this doesn't appear to be happening.

If you are blocked & need a release ASAP you could try something like patch-package to apply the changes after an npm install, though you'll likely lose compatibility with any rust-libp2p nodes until https://github.com/libp2p/rust-libp2p/issues/5986 is resolved.

achingbrain avatar Apr 15 '25 08:04 achingbrain

I'm still seeing this even after #3076 - I think the bug might be in node-datachannel, it looks like sometimes RTCDataChannels just don't close properly - https://github.com/murat-dogan/node-datachannel/issues/349

achingbrain avatar Apr 16 '25 07:04 achingbrain