aws-sdk-js-v3 icon indicating copy to clipboard operation
aws-sdk-js-v3 copied to clipboard

`MetadataService` leaves sockets open

Open otaviomacedo opened this issue 3 weeks ago • 7 comments

Checkboxes for prior research

Describe the bug

The MetadataService.fetchMetadataToken() leaves open sockets behind. The following TyepScript code shows the problem:

import { MetadataService } from "@aws-sdk/ec2-metadata-service";

async function main() {
  try {
    const metadataService = new MetadataService({
      httpOptions: {
        timeout: 1000,
      },
    });
    await metadataService.fetchMetadataToken();
  } catch (error) {
    // Not really interested in errors here
  }

  console.log('handles:', (process as any)._getActiveHandles());
}

main().catch(console.error);

The result of running this program is something like:

/Users/otaviom/.local/share/nvm/v22.16.0/bin/node --import file:/Applications/IntelliJ%20IDEA.app/Contents/plugins/nodeJS/js/ts-file-loader/node_modules/tsx/dist/loader.cjs /Users/otaviom/projects/cdk-app/bin/lab.ts
handles: [
  ...
  <ref *3> Socket {
    connecting: false,
    _hadError: false,
    _parent: null,
    _host: null,
    _closeAfterHandlingError: false,
    _events: ...,
    _readableState: ...,
    _writableState: ...,
    allowHalfOpen: false,
    _maxListeners: undefined,
    _eventsCount: 9,
    _sockname: null,
    _pendingData: 'PUT /latest/api/token HTTP/1.1\r\n' +
      'x-aws-ec2-metadata-token-ttl-seconds: 21600\r\n' +
      'Host: 169.254.169.254\r\n' +
      'Connection: keep-alive\r\n' +
      'Content-Length: 0\r\n' +
      '\r\n',
    _pendingEncoding: 'latin1',
    server: null,
    _server: null,
    parser: ...,
    _httpMessage: ClientRequest {
      _events: [Object: null prototype],
      _eventsCount: 3,
      _maxListeners: undefined,
      outputData: [],
      outputSize: 0,
      writable: true,
      destroyed: true,
      _last: false,
      chunkedEncoding: false,
      shouldKeepAlive: true,
      maxRequestsOnConnectionReached: false,
      _defaultKeepAlive: true,
      useChunkedEncodingByDefault: true,
      sendDate: false,
      _removedConnection: false,
      _removedContLen: false,
      _removedTE: false,
      strictContentLength: false,
      _contentLength: 0,
      _hasBody: true,
      _trailer: '',
      finished: true,
      _headerSent: true,
      _closed: false,
      _header: 'PUT /latest/api/token HTTP/1.1\r\n' +
        'x-aws-ec2-metadata-token-ttl-seconds: 21600\r\n' +
        'Host: 169.254.169.254\r\n' +
        'Connection: keep-alive\r\n' +
        'Content-Length: 0\r\n' +
        '\r\n',
      _keepAliveTimeout: 0,
      _onPendingData: [Function: nop],
      agent: [Agent],
      socketPath: undefined,
      method: 'PUT',
      maxHeaderSize: undefined,
      insecureHTTPParser: undefined,
      joinDuplicateHeaders: undefined,
      path: '/latest/api/token',
      _ended: false,
      res: null,
      aborted: false,
      timeoutCb: null,
      upgradeOrConnect: false,
      parser: [HTTPParser],
      maxHeadersCount: null,
      reusedSocket: false,
      host: '169.254.169.254',
      protocol: 'http:',
      ...
    },
    ...
  }
]

In some environments (e.g., GitHub workflows), this causes the Node process to hang.

Regression Issue

  • [x] Select this option if this issue appears to be a regression.

SDK version number

@aws-sdk/[email protected]

Which JavaScript Runtime is this issue in?

Node.js

Details of the browser/Node.js/ReactNative version

Node version: v24.11.1

Reproduction Steps

Run the code above.

Observed Behavior

There are open sockets at the end of the program execution.

Expected Behavior

All sockets should have been closed.

Possible Solution

No response

Additional Information/Context

No response

otaviomacedo avatar Dec 03 '25 13:12 otaviomacedo

e.g., GitHub workflows

Probably only GitHub workflows running on CodeBuild runners (i.e., this will happen on anything that runs on EC2)

rix0rrr avatar Dec 03 '25 13:12 rix0rrr

e.g., GitHub workflows

Probably only GitHub workflows running on CodeBuild runners (i.e., this will happen on anything that runs on EC2)

Do you mean ECS?

otaviomacedo avatar Dec 03 '25 15:12 otaviomacedo

Red herring! We in fact only see this on naked GitHub Actions machines.

rix0rrr avatar Dec 03 '25 15:12 rix0rrr

Here's a minimal example that shows the hangup in GitHub:

https://github.com/otaviomacedo/node-hangup-sdk/actions/runs/19927783246/job/57131773612?pr=2

At the end of the output:

PASS test/node-hangup-sdk.test.ts (10.111 s)
  ✓ Should not hang up (34 ms)
Test Suites: 1 passed, 1 total
Tests:       1 passed, 1 total
Snapshots:   0 total
Time:        10.324 s
Ran all test suites.
Jest did not exit one second after the test run has completed.
'This usually means that there are asynchronous operations that weren't stopped in your tests. Consider running Jest with `--detectOpenHandles` to troubleshoot this issue.

otaviomacedo avatar Dec 04 '25 11:12 otaviomacedo

I've come to a discovery:

  • On GitHub Actions, they're running a routable version of the IMDS endpoint (169.254.169.254) that always returns a 400.

So here's the difference:

Machine NodeHttpHandler MetadataService Hangs?
Our Macs 🛑 fails (IMDS not routable) 🛑 Fails 👍 No
EC2 instance ✅ succeeds with 200 ✅ succeeds 👍 No
GitHub Actions ✅ succeeds with 400 🛑 Fails with an error 😡 Yes

The issue seems to lie at the boundary of the NodeHttpHandler successfully fetching an HTTP failure page, and the MetadataService turning that into a failure. The difference is that in a 200 case, the response body gets consumed but in a 400 case it doesn't.

And indeed, adding this line:

Image

Fixes the issue.

The lack of consuming the response body keeps the socket alive, which keeps Node alive.

With this information, this is now easy to reproduce on a developer machine:

$ env AWS_EC2_METADATA_SERVICE_ENDPOINT=https://google.com/urldoesntexist npx jest

Makes it reproduce with Otavio's repository on my machine.


What is odd here is that the MetadataService hasn't seen changes for over a year, so it must be some recent changes to the NodeHttpHandler that are now triggering this behavior.

rix0rrr avatar Dec 04 '25 15:12 rix0rrr

Relatedly, I've been looking at the NodeHttpHandler and I noticed the handling of error cases could be a bit cleaner.

Specifically, when introducing console.log()s in there to see what's happening I'm noticing that the promise is tried to be rejected more than once, once by the NodeHttpHandler itself, and then once again as an ECONNRESET error comes out of the underlying socket.

Concretely, what is currently being done is this:

req.destroy();
reject(someError);        // Rejection 1 with someError

req.on("error", (e) => reject(e));  // Fires again with ECONNRESET

Whereas I believe what you are supposed to do is:

req.destroy(someError);

req.on("error", (e) => reject(e));    // Fires once with someError

I don't think this is necessarily the cause of any leaks or hangs, I've been fixing it up and looking at the diff afterwards and every looks like it's being closed correctly regardless, and a second resolve/reject is ignored anyway. But it feels like resource management mistakes could easily slip in here. (*)

(*) Only looked at NodeHttpHandler, not NodeHttp2Handler

rix0rrr avatar Dec 04 '25 15:12 rix0rrr

Fix was released in v3.946.0.

siddsriv avatar Dec 09 '25 19:12 siddsriv