Inconsistent "Server certificate CA fingerprint does not match the value configured in caFingerprint" (potential race condition?)
π Bug report
We have an application under development using Elasticsearch self hosted, with self signed certificates, with clients connecting using TLS and CA Fingerprints. However, we are running into what appears to be some kind of bug with the library, or potentially even Elasticsearch itself. The issue is not consistent from several hours of testing.
To reproduce
I have uploaded a repo here which is a stripped down poc of the issue based on our application code.
https://github.com/the-gabe/elastic-failure/tree/main
usage instructions:
curl -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.15.0-linux-x86_64.tar.gz
bsdtar xvf elasticsearch-8.15.0-linux-x86_64.tar.gz
cd elasticsearch-8.15.0
./bin/elasticsearch
note down the fingerprint and password when printed in terminal
git clone https://github.com/the-gabe/elastic-failure
cd elastic-failure
edit packages/indexer/vars.bash so that ELASTIC_QUEUE_PASSWORD , ELASTIC_VECTOR_PASSWORD , ELASTIC_QUEUE_FINGERPRINT and ELASTIC_VECTOR_FINGERPRINT reflect the password and CA fingerprint you noted down.
cd packages/indexer
npm ci --no-scripts
npm run build
bash vars.bash
observe output in terminal where both clients are able to obtain the elasticsearch version just fine. but then you get a caFingerprint failure after this. The output has been included in the root of the repo, in a file here https://github.com/the-gabe/elastic-failure/blob/main/logoutput.txt This file was created on the Arch Linux environment described below, with elasticsearch 8.15.0. On Azure App Service, we were using Elasticsearch 8.14.3-1 on RHEL 9.
I have found that this issue is reproducible around 30-40% of the time, but is a guess, and is not backed by testing. I have found that starting with a fresh elasticsearch-8.15.0 folder can help, but this may be coincidence. I suspect in a speculative fashion that it could be a race condition.
Expected behavior
this just should not happen
Node.js version
Node.js v22.6.0 on Arch Linux, v20.11.1 on Azure App Service, v20.16.0 on Debian 12
@elastic/elasticsearch version
8.15.0
Operating system
Arch Linux on WSL2, Debian 11 on Azure App Service, Debian 12
Any other relevant environment information
No response
Additionally, we hacked the library to try see what was going on in the fingerprint comparison. Logs have been attached here:
https://github.com/the-gabe/elastic-failure/blob/main/appservice-hackedlib.txt (Note: for clarity w.r.t line numbers, this log was run with our actual application, not the code in the git repo)
We modified node_modules/@elastic/transport/lib/connection/UndiciConnection.js
Here is a snippet of how it looked.
if (this[symbols_1.kCaFingerprint] !== null) {
const caFingerprint = this[symbols_1.kCaFingerprint];
const connector = (0, undici_1.buildConnector)(((_a = this.tls) !== null && _a !== void 0 ? _a : {}));
undiciOptions.connect = function (opts, cb) {
connector(opts, (err, socket) => {
if (err != null) {
return cb(err, null);
}
if (caFingerprint !== null && isTlsSocket(opts, socket)) {
const issuerCertificate = (0, BaseConnection_1.getIssuerCertificate)(socket);
/* istanbul ignore next */
if (issuerCertificate == null) {
socket.destroy();
return cb(new Error('Invalid or malformed certificate'), null);
}
// Check if fingerprint matches
/* istanbul ignore else */
ββββββconsole.log("this is what we provided to the lib " + caFingerprint);
ββββββconsole.log("This is what was pulled from socket " + issuerCertificate.fingerprint256);
if (caFingerprint !== issuerCertificate.fingerprint256) {
socket.destroy();
return cb(new Error("Server certificate CA fingerprint does not match the value configured in caFingerprint"), null);
}
}
return cb(null, socket);
});
};
}
And for the sake of 100% clarity, we triple checked that "4F:57:DA:6A:80:46:C5:9F:BD:9E:49:78:BA:26:A2:FC:39:1D:32:B7:63:6C:7D:96:82:6A:1E:C5:BE:24:26:48" was valid for our CA fingerprint, we know it is as we checked several times with openssl x509 -fingerprint -sha256 -in /etc/elasticsearch/certs/http_ca.crt | grep Fingerprint and we have other applications using this fine.
Just to rule it out: it wouldn't have anything to do with this change, would it?
Hi @JoshMock , I don't think so, the actual fingerprints taken from the socket are returning undefined.
@JoshMock We confirmed that this is not related and have tested with 8.7.0 of @elastic/transport instead of 8.7.1
Got it, didn't look at the logs close enough to see that it was undefined. Definitely not related. π
Hi,
I'm also having a similar issue. I'm getting error: Unhandled Rejection at: Promise [object Promise] reason ConnectionError: Invalid or malformed certificate with a valid caFingerprint that works in python client but in js results in the error. I'm using 8.15.0 and node 22.
Hi @JoshMock have you managed to look into this? This is impacting our production environments with this application now, and it's not a situation we are comfortable with. This is quite literally a mission critical functional of the library (being able to connect to Elasticsearch in an encrypted and authenticated fashion securely), is there any progress being made regarding this bug in a private capacity?
No action has been taken yet, @the-gabe. I'm Elastic's only active maintainer of this project, and I've been either on PTO or occupied with higher priorities for the last few weeks. I will take a look as soon as I have time.
If you need a fix more urgently, pull requests are always welcome. I am typically able to review and merge a PR within a couple of working days if it has tests and all CI checks are passing.
This issue is stale because it has been open 90 days with no activity. Remove the stale label, or leave a comment, or this will be closed in 14 days.
Definitely able to reproduce this with your example code. It appears to happen when tlsSocket.getPeerCertificate() returns an empty object, which means that the peer is not providing a certificate. Even when the request is retried repeatedly, a peer certificate is never provided in most cases.
At first it seemed like it might have something to do with instantiating multiple clients with the same TLS configuration. I seemed to have a much better success rate when vectorClient was a child of queueClient (e.g. const vectorClient = queueClient.child(vectorConfig)).
Then I started running tests where I instantiated or reused the same client instances in a loop to run several requests in sequence. It seems that, no matter how many clients are instantiated, only the first one or two requests will succeed and the rest will fail.
The good news: I can reproduce this problem in a unit test in @elastic/transport.
The bad news: I still have no idea why it's happening. π
Hopefully better news to come once I investigate more.
Turns out this is by design in Node.js! Got a fix in https://github.com/elastic/elastic-transport-js/pull/197, which describes the whole situation. Once that merges, I'll publish a patch version of the transport, which your next npm install should pick up automatically.
@elastic/transport v8.9.2 is now out!
@JoshMock Thank you so much! This is extremely helpful :)