teleport
teleport copied to clipboard
Teleport unexpectedly closes some mongo connections
Expected behavior:
A mongo client shouldn't experience a disconnect
Current behavior:
When requesting some objects from MongoDB through Teleport, the mongo client will disconnect with an error similar to the following:
[direct: mongos] client_production> db.getCollection('MyCollection').find({})
Uncaught:
MongoNetworkError: connection 4 to 10.0.0.22:443 closed
at Connection.onClose (/opt/homebrew/Cellar/mongosh/1.5.1/libexec/lib/node_modules/@mongosh/cli-repl/node_modules/mongodb/lib/cmap/connection.js:138:19)
at TLSSocket.<anonymous> (/opt/homebrew/Cellar/mongosh/1.5.1/libexec/lib/node_modules/@mongosh/cli-repl/node_modules/mongodb/lib/cmap/connection.js:65:46)
at TLSSocket.emit (node:events:527:28)
at TLSSocket.emit (node:domain:475:12)
at node:net:709:12
at TCP.done (node:_tls_wrap:582:7)
at TCP.callbackTrampoline (node:internal/async_hooks:130:17)
client_production> exit
In the Teleport agent that is connected to the mongo database, the following stacktrace appears:
ERRO [DB:SERVIC] Failed to handle connection. addr:10.0.0.22:443 error:[
ERROR REPORT:
Original Error: topology.ConnectionError connection(mongos-3.prod.mongo:27017[-199]) incomplete read of full message: local error: tls: bad record MAC
Stack Trace:
/go/src/github.com/gravitational/teleport/lib/srv/db/mongodb/protocol/message.go:84 github.com/gravitational/teleport/lib/srv/db/mongodb/protocol.ReadServerMessage
/go/src/github.com/gravitational/teleport/lib/srv/db/mongodb/engine.go:130 github.com/gravitational/teleport/lib/srv/db/mongodb.(*Engine).handleClientMessage
/go/src/github.com/gravitational/teleport/lib/srv/db/mongodb/engine.go:96 github.com/gravitational/teleport/lib/srv/db/mongodb.(*Engine).HandleConnection
/go/src/github.com/gravitational/teleport/lib/srv/db/server.go:758 github.com/gravitational/teleport/lib/srv/db.(*Server).handleConnection
/go/src/github.com/gravitational/teleport/lib/srv/db/server.go:668 github.com/gravitational/teleport/lib/srv/db.(*Server).HandleConnection
/go/src/github.com/gravitational/teleport/lib/reversetunnel/transport.go:284 github.com/gravitational/teleport/lib/reversetunnel.(*transport).start
/opt/go/src/runtime/asm_amd64.s:1581 runtime.goexit
User Message: connection(mongos-3.prod.mongo:27017[-199]) incomplete read of full message: local error: tls: bad record MAC] db/server.go:670
This only happens if certain objects are retrieved. It does appear that larger objects are more likely to cause the issue to happen.
For example, given these hypothetical objects:
{[direct: mongos] client_tech> bsonsize(db.MyCollection.findOne({ _id: ObjectId("4d6f6e205365702031322031")}))
38648
{[direct: mongos] client_tech> bsonsize(db.MyCollection.findOne({ _id: ObjectId("373a34353a32352045445420")}))
932384
and through teleport:
{[direct: mongos] client_tech> bsonsize(db.MyCollection.findOne({ _id: ObjectId("4d6f6e205365702031322031")}))
38648
{[direct: mongos] client_tech> bsonsize(db.MyCollection.findOne({ _id: ObjectId("373a34353a32352045445420")}))
MongoNetworkError: connection 7 to 10.0.0.22:443 closed
Bug details:
- Teleport version - 9.3.8
-
Mongo Version -
apt list --installed
showsmongodb-org-mongos/xenial,now 3.6.14 amd64 [installed,upgradable to: 3.6.23]
- OS - MongoDB is running on Ubuntu 16.04. They are aware the versions are older. They are working with their application developers to get their code to work with newer mongo versions.
- Recreation steps - not 100% validated, but it is likely that specific mongo version on ubuntu 16.04 with the same
libssl1.0.0/xenial-updates,xenial-security,now 1.0.2g-1ubuntu4.20 amd64 [installed]
will experience a similar symptom when retrieving an object approaching 1 MiB.
gz#6677
@programmerq I've tried to perform a findOne
that returns a document with about 5Mb and a find
in a collection with 100Mb. Both returned successfully. Additionally, I've added network issues like latency, corruption, and loss but have had no success reproducing the problem.
I have some questions that might help me setup a closer environment:
- You briefly mentioned, but are you using separate instances for proxy, auth server, and database agent?
- Are you using TLS routing?
- About the client, is there any different option set? I've tried different
mongosh
versions, but it seems to have no difference; - Is there any specific MongoDB configuration set? (I've tested with default configuration);
- Are those queries you're running taking a long time to return, or do they return almost instantly?
Besides that, I've seen this error before. Usually, it happens to packet corruption or drop and could be related to network issues (even though I couldn't reproduce it with failures). Do you know if MongoDB or any Teleport component is deployed behind a load balancer or proxy? (I haven't tried those scenarios yet, but it might be worth trying).
Any additional information you can provide would be helpful. Thanks!
I'll share the partial answers that I have. I'm simultaneously reaching out to get the rest of the answers.
- You briefly mentioned, but are you using separate instances for proxy, auth server, and database agent?
I know the database agent is running as a separate process from the proxy service. I don't know for certain whether the auth service is running alongside the proxy service.
- Are you using TLS routing?
Will check and get back. I know the mongosh client is communicating via port 443, so I believe it is multiplexing.
- About the client, is there any different option set? I've tried different
mongosh
versions, but it seems to have no difference;
Will check on option set and get back. It's mongosh 1.5.1 installed via brew on a mac.
- Is there any specific MongoDB configuration set? (I've tested with default configuration);
Will check and get back.
- Are those queries you're running taking a long time to return, or do they return almost instantly?
The log for the query and subsequent failure happen almost immediately. Both events are logged as happening within one second (no microsecond resolution)
mongod.cfg:
storage:
dbPath: /var/lib/mongodb
journal:
enabled: true
systemLog:
destination: file
logAppend: true
path: /var/log/mongodb/mongod.log
net:
port: 27017
bindIp: 127.0.0.1
processManagement:
timeZoneInfo: /usr/share/zoneinfo
mongos.cfg:
systemLog:
verbosity: 0
quiet: false
logAppend: true
logRotate: reopen
destination: file
path: /var/log/mongodb/mongos.log
processManagement:
fork: false
pidFilePath: /var/lib/mongodb/mongos.log
net:
port: 27017
bindIpAll: true
maxIncomingConnections: 51200
wireObjectCheck: true
transportLayer: asio
serviceExecutor: adaptive
ssl:
mode: allowSSL
PEMKeyFile: /etc/certs/mongo.crt
CAFile: /etc/certs/mongo.cas
allowConnectionsWithoutCertificates: true
replication:
localPingThresholdMs: 15
sharding:
configDB: config/config-0.staging.mongo:27019,config-1.staging.mongo:27019,config-2.staging.mongo:27019
security:
keyFile: /etc/keyFile
clusterAuthMode: keyFile
db admin command:
[direct: mongos] client_tech> db.adminCommand({getParameter:'*'})
{
KeysRotationIntervalSec: 7776000,
KillSessionsMaxConcurrency: 100,
KillSessionsPerHostTimeoutMS: 60000,
ShardingTaskExecutorPoolHostTimeoutMS: 300000,
ShardingTaskExecutorPoolMaxConnecting: 2,
ShardingTaskExecutorPoolMaxSize: -1,
ShardingTaskExecutorPoolMinSize: 1,
ShardingTaskExecutorPoolRefreshRequirementMS: 60000,
ShardingTaskExecutorPoolRefreshTimeoutMS: 20000,
adaptiveServiceExecutorIdlePctThreshold: 60,
adaptiveServiceExecutorMaxQueueLatencyMicros: 500,
adaptiveServiceExecutorRecursionLimit: 8,
adaptiveServiceExecutorReservedThreads: 2,
adaptiveServiceExecutorRunTimeJitterMillis: 500,
adaptiveServiceExecutorRunTimeMillis: 5000,
adaptiveServiceExecutorStuckThreadTimeoutMillis: 250,
allowRolesFromX509Certificates: true,
authSchemaVersion: 5,
authenticationMechanisms: [ 'MONGODB-CR', 'MONGODB-X509', 'SCRAM-SHA-1' ],
clientCursorMonitorFrequencySecs: 4,
clusterAuthMode: 'keyFile',
connPoolMaxConnsPerHost: 200,
connPoolMaxInUseConnsPerHost: 2147483647,
connPoolMaxShardedConnsPerHost: 200,
connPoolMaxShardedInUseConnsPerHost: 2147483647,
cursorTimeoutMillis: Long("600000"),
debugCollectionUUIDs: false,
diagnosticDataCollectionDirectoryPath: '/var/log/mongodb/mongos.diagnostic.data',
diagnosticDataCollectionDirectorySizeMB: 200,
diagnosticDataCollectionEnabled: true,
diagnosticDataCollectionFileSizeMB: 10,
diagnosticDataCollectionPeriodMillis: 1000,
diagnosticDataCollectionSamplesPerChunk: 300,
diagnosticDataCollectionSamplesPerInterimUpdate: 10,
disableLogicalSessionCacheRefresh: false,
disableNonSSLConnectionLogging: false,
disabledSecureAllocatorDomains: [],
enableLocalhostAuthBypass: true,
enableTestCommands: false,
globalConnPoolIdleTimeoutMinutes: 2147483647,
heapProfilingEnabled: false,
heapProfilingSampleIntervalBytes: Long("262144"),
honorSystemUmask: false,
internalDocumentSourceCursorBatchSizeBytes: 4194304,
internalDocumentSourceLookupCacheSizeBytes: 104857600,
internalGeoNearQuery2DMaxCoveringCells: 16,
internalGeoPredicateQuery2DMaxCoveringCells: 16,
internalInsertMaxBatchSize: 64,
internalLookupStageIntermediateDocumentMaxSizeBytes: Long("104857600"),
internalProhibitShardOperationRetry: false,
internalQueryAlwaysMergeOnPrimaryShard: false,
internalQueryCacheEvictionRatio: 10,
internalQueryCacheFeedbacksStored: 20,
internalQueryCacheSize: 5000,
internalQueryEnumerationMaxIntersectPerAnd: 3,
internalQueryEnumerationMaxOrSolutions: 10,
internalQueryExecMaxBlockingSortBytes: 33554432,
internalQueryExecYieldIterations: 128,
internalQueryExecYieldPeriodMS: 10,
internalQueryFacetBufferSizeBytes: 104857600,
internalQueryForceIntersectionPlans: false,
internalQueryIgnoreUnknownJSONSchemaKeywords: false,
internalQueryMaxScansToExplode: 200,
internalQueryPlanEvaluationCollFraction: 0.3,
internalQueryPlanEvaluationMaxResults: 101,
internalQueryPlanEvaluationWorks: 10000,
internalQueryPlanOrChildrenIndependently: true,
internalQueryPlannerEnableHashIntersection: false,
internalQueryPlannerEnableIndexIntersection: true,
internalQueryPlannerGenerateCoveredWholeIndexScans: false,
internalQueryPlannerMaxIndexedSolutions: 64,
internalQueryProhibitBlockingMergeOnMongoS: false,
internalQueryProhibitMergingOnMongoS: false,
internalQueryS2GeoCoarsestLevel: 0,
internalQueryS2GeoFinestLevel: 23,
internalQueryS2GeoMaxCells: 20,
journalCommitInterval: 0,
localLogicalSessionTimeoutMinutes: 30,
logComponentVerbosity: {
verbosity: 0,
accessControl: { verbosity: -1 },
command: { verbosity: -1 },
control: { verbosity: -1 },
executor: { verbosity: -1 },
geo: { verbosity: -1 },
index: { verbosity: -1 },
network: {
verbosity: -1,
asio: { verbosity: -1 },
bridge: { verbosity: -1 }
},
query: { verbosity: -1 },
replication: {
verbosity: -1,
heartbeats: { verbosity: -1 },
rollback: { verbosity: -1 }
},
sharding: { verbosity: -1, shardingCatalogRefresh: { verbosity: -1 } },
storage: { verbosity: -1, journal: { verbosity: -1 } },
write: { verbosity: -1 },
ftdc: { verbosity: -1 },
tracking: { verbosity: -1 }
},
logLevel: 0,
logUserIds: false,
logicalSessionRefreshMillis: 300000,
maxAcceptableLogicalClockDriftSecs: Long("31536000"),
maxBSONDepth: 200,
maxLogSizeKB: 10,
maxSessions: 1000000,
maxSyncSourceLagSecs: 30,
notablescan: false,
opensslCipherConfig: '',
opensslDiffieHellmanParameters: '',
quiet: false,
replElectionTimeoutOffsetLimitFraction: 0.15,
reportOpWriteConcernCountersInServerStatus: false,
reservedServiceExecutorRecursionLimit: 8,
saslHostName: 'ip-10-240-8-4',
saslServiceName: 'mongodb',
saslauthdPath: '',
scramIterationCount: 10000,
scriptingEngineInterruptIntervalMS: 1000,
shardedConnPoolIdleTimeoutMinutes: 2147483647,
skipShardingConfigurationChecks: false,
sslMode: 'allowSSL',
sslWithholdClientCertificate: false,
startupAuthSchemaValidation: true,
suppressNoTLSPeerCertificateWarning: false,
syncdelay: 60,
synchronousServiceExecutorRecursionLimit: 8,
taskExecutorPoolSize: 0,
tcmallocAggressiveMemoryDecommit: 0,
tcmallocEnableMarkThreadTemporarilyIdle: false,
tcmallocMaxTotalThreadCacheBytes: Long("1073741824"),
traceExceptions: false,
userCacheInvalidationIntervalSecs: 30,
ok: 1,
operationTime: Timestamp({ t: 1663771335, i: 11 }),
'$clusterTime': {
clusterTime: Timestamp({ t: 1663771335, i: 11 }),
signature: {
hash: Binary(Buffer.from("64050457aa956508b9903c2df620f869517350fb", "hex"), 0),
keyId: Long("7101343242678960129")
}
}
}
Could it be related to https://github.com/gravitational/teleport/pull/16017 ?
Could it be related to https://github.com/gravitational/teleport/pull/16017 ?
It seems to be not related. This issue was observed in Teleport version - 9.3.8
where #16017 was backported only to Teleport v10
I confirm, the issue has been reproduced with Teleport 10.3.1
@gpr I would like to ask you about sharing more details about your setup. We have tried to reproduce it based on the ticked description but we didn't manage to reproduce this error.
Are are you using any kind of Load balancer between tsh/ mongo client and teleport proxy ?
@smallinsky yes AWS ELB.
Hey, I managed to reproduce it with the flowing consistently:
- MongoDB running on an x86_64 image (I've tried arm64 but without success, even when the Ubuntu/MongoDB versions are the same);
- Config server (mongod), two shard servers (mongod), and one router (mongos);
- With the
allowConnectionsWithoutCertificates: true
: I'm not sure if I misconfigured when using a certificate; I'll try to configure it properly later to see if this prevents the issue (it could be related);
Right now, I'm debugging it to see what is causing the issue, as a what I have seen already:
- This happens only with extensive collections (as described in the issue), even if they're located in a single shard;
- Shards logs don't show any error. The router shows only a
SocketException: Broken pipe.
after the Teleport error. So it seems to be an error only on the Teleport -> Router communication;
I will look more in-depth and debug the MongoDB driver (as the error is happening there). Not sure if the issue is on the driver itself (I've looked on the issues list and didn't find anything related to this) or with our reading message flow (more likely).
After some testing, I found that this issue happens when connecting directly to the mongos
instance (without Teleport). I managed to get it to work by changing the net.serviceExecutor
from adaptive
to synchronous
.
I couldn't find any issue on the MongoDB tracking that was directly related to this. After some investigation, it seems that when using net.ServiceExecutor: adaptive
the mongos
instance uses multiple threads to manage connections, which doesn't seem to play well with this version of OpenSSL. And it could be causing the bad_record_mac
error on the Teleport side.
I suggest changing the configuration to see if it solves the issue for your deployment. It is worth mentioning that the adaptive
option is deprecated on MongoDB v4, and the net.serviceExecutor
is removed from v5 and beyond, so if you're planning to update your instances, this will be new behavior.
Let me know if you have any questions. Thanks!
we were connected yesterday as I tried this option too. I confirm that it works many thanks for the deep analysis @gabrielcorado; very useful.
Sounds like this is not Teleport issue and the workaround is available. Closing this, feel free to reopen if it's still an issue.