`MetadataService` leaves sockets open
Checkboxes for prior research
- [x] I've gone through Developer Guide and API reference
- [x] I've checked AWS Forums and StackOverflow.
- [x] I've searched for previous similar issues and didn't find any solution.
Describe the bug
The MetadataService.fetchMetadataToken() leaves open sockets behind. The following TyepScript code shows the problem:
import { MetadataService } from "@aws-sdk/ec2-metadata-service";
async function main() {
try {
const metadataService = new MetadataService({
httpOptions: {
timeout: 1000,
},
});
await metadataService.fetchMetadataToken();
} catch (error) {
// Not really interested in errors here
}
console.log('handles:', (process as any)._getActiveHandles());
}
main().catch(console.error);
The result of running this program is something like:
/Users/otaviom/.local/share/nvm/v22.16.0/bin/node --import file:/Applications/IntelliJ%20IDEA.app/Contents/plugins/nodeJS/js/ts-file-loader/node_modules/tsx/dist/loader.cjs /Users/otaviom/projects/cdk-app/bin/lab.ts
handles: [
...
<ref *3> Socket {
connecting: false,
_hadError: false,
_parent: null,
_host: null,
_closeAfterHandlingError: false,
_events: ...,
_readableState: ...,
_writableState: ...,
allowHalfOpen: false,
_maxListeners: undefined,
_eventsCount: 9,
_sockname: null,
_pendingData: 'PUT /latest/api/token HTTP/1.1\r\n' +
'x-aws-ec2-metadata-token-ttl-seconds: 21600\r\n' +
'Host: 169.254.169.254\r\n' +
'Connection: keep-alive\r\n' +
'Content-Length: 0\r\n' +
'\r\n',
_pendingEncoding: 'latin1',
server: null,
_server: null,
parser: ...,
_httpMessage: ClientRequest {
_events: [Object: null prototype],
_eventsCount: 3,
_maxListeners: undefined,
outputData: [],
outputSize: 0,
writable: true,
destroyed: true,
_last: false,
chunkedEncoding: false,
shouldKeepAlive: true,
maxRequestsOnConnectionReached: false,
_defaultKeepAlive: true,
useChunkedEncodingByDefault: true,
sendDate: false,
_removedConnection: false,
_removedContLen: false,
_removedTE: false,
strictContentLength: false,
_contentLength: 0,
_hasBody: true,
_trailer: '',
finished: true,
_headerSent: true,
_closed: false,
_header: 'PUT /latest/api/token HTTP/1.1\r\n' +
'x-aws-ec2-metadata-token-ttl-seconds: 21600\r\n' +
'Host: 169.254.169.254\r\n' +
'Connection: keep-alive\r\n' +
'Content-Length: 0\r\n' +
'\r\n',
_keepAliveTimeout: 0,
_onPendingData: [Function: nop],
agent: [Agent],
socketPath: undefined,
method: 'PUT',
maxHeaderSize: undefined,
insecureHTTPParser: undefined,
joinDuplicateHeaders: undefined,
path: '/latest/api/token',
_ended: false,
res: null,
aborted: false,
timeoutCb: null,
upgradeOrConnect: false,
parser: [HTTPParser],
maxHeadersCount: null,
reusedSocket: false,
host: '169.254.169.254',
protocol: 'http:',
...
},
...
}
]
In some environments (e.g., GitHub workflows), this causes the Node process to hang.
Regression Issue
- [x] Select this option if this issue appears to be a regression.
SDK version number
@aws-sdk/[email protected]
Which JavaScript Runtime is this issue in?
Node.js
Details of the browser/Node.js/ReactNative version
Node version: v24.11.1
Reproduction Steps
Run the code above.
Observed Behavior
There are open sockets at the end of the program execution.
Expected Behavior
All sockets should have been closed.
Possible Solution
No response
Additional Information/Context
No response
e.g., GitHub workflows
Probably only GitHub workflows running on CodeBuild runners (i.e., this will happen on anything that runs on EC2)
e.g., GitHub workflows
Probably only GitHub workflows running on CodeBuild runners (i.e., this will happen on anything that runs on EC2)
Do you mean ECS?
Red herring! We in fact only see this on naked GitHub Actions machines.
Here's a minimal example that shows the hangup in GitHub:
https://github.com/otaviomacedo/node-hangup-sdk/actions/runs/19927783246/job/57131773612?pr=2
At the end of the output:
PASS test/node-hangup-sdk.test.ts (10.111 s)
✓ Should not hang up (34 ms)
Test Suites: 1 passed, 1 total
Tests: 1 passed, 1 total
Snapshots: 0 total
Time: 10.324 s
Ran all test suites.
Jest did not exit one second after the test run has completed.
'This usually means that there are asynchronous operations that weren't stopped in your tests. Consider running Jest with `--detectOpenHandles` to troubleshoot this issue.
I've come to a discovery:
- On GitHub Actions, they're running a routable version of the IMDS endpoint (
169.254.169.254) that always returns a 400.
So here's the difference:
| Machine | NodeHttpHandler | MetadataService | Hangs? |
|---|---|---|---|
| Our Macs | 🛑 fails (IMDS not routable) | 🛑 Fails | 👍 No |
| EC2 instance | ✅ succeeds with 200 | ✅ succeeds | 👍 No |
| GitHub Actions | ✅ succeeds with 400 | 🛑 Fails with an error | 😡 Yes |
The issue seems to lie at the boundary of the NodeHttpHandler successfully fetching an HTTP failure page, and the MetadataService turning that into a failure. The difference is that in a 200 case, the response body gets consumed but in a 400 case it doesn't.
And indeed, adding this line:
Fixes the issue.
The lack of consuming the response body keeps the socket alive, which keeps Node alive.
With this information, this is now easy to reproduce on a developer machine:
$ env AWS_EC2_METADATA_SERVICE_ENDPOINT=https://google.com/urldoesntexist npx jest
Makes it reproduce with Otavio's repository on my machine.
What is odd here is that the MetadataService hasn't seen changes for over a year, so it must be some recent changes to the NodeHttpHandler that are now triggering this behavior.
Relatedly, I've been looking at the NodeHttpHandler and I noticed the handling of error cases could be a bit cleaner.
Specifically, when introducing console.log()s in there to see what's happening I'm noticing that the promise is tried to be rejected more than once, once by the NodeHttpHandler itself, and then once again as an ECONNRESET error comes out of the underlying socket.
Concretely, what is currently being done is this:
req.destroy();
reject(someError); // Rejection 1 with someError
req.on("error", (e) => reject(e)); // Fires again with ECONNRESET
Whereas I believe what you are supposed to do is:
req.destroy(someError);
req.on("error", (e) => reject(e)); // Fires once with someError
I don't think this is necessarily the cause of any leaks or hangs, I've been fixing it up and looking at the diff afterwards and every looks like it's being closed correctly regardless, and a second resolve/reject is ignored anyway. But it feels like resource management mistakes could easily slip in here. (*)
(*) Only looked at NodeHttpHandler, not NodeHttp2Handler
Fix was released in v3.946.0.