mongo-oplog icon indicating copy to clipboard operation
mongo-oplog copied to clipboard

MongoError: operation exceeded time limit

Open doktoroblivion opened this issue 7 years ago • 7 comments

Anyone seeing the above error on the oplog cursor? My configuration is I have a remote machine which is my oplog source, and my oplog application is running locally pointing to a local mongoDB instance. After some time and many insert, update and delete events, I will eventually get the following. I am using mongoDB 3.4.x locally and my remote machine is 3.2.8.

The following appears to occur sometime when questions of connection and query length reach the internal MongoDB limit cursor.maxTimeMS. Can anyone tell me or suggest a sufficient reason that the code would get this? stream.js appears to be correctly coded, is there an issue elsewhere?

    tailable: true,
    awaitData: true,
    oplogReplay: true,
    noCursorTimeout: true,

Error is:

  startOplog 0] ------- Error ------- +10s
  startOplog-details 0] Error object: { MongoError: operation exceeded time limit
    at Function.MongoError.create (/home/e/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/error.js:31:11)
    at /home/e/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/connection/pool.js:497:72
    at authenticateStragglers (/home/e/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/connection/pool.js:443:16)
    at Connection.messageHandler (/home/e/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/connection/pool.js:477:5)
    at TLSSocket.<anonymous> (/home/e/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/connection/connection.js:331:22)
    at emitOne (events.js:115:13)
    at TLSSocket.emit (events.js:210:7)
    at addChunk (_stream_readable.js:252:12)
    at readableAddChunk (_stream_readable.js:239:11)
    at TLSSocket.Readable.push (_stream_readable.js:197:10)
  name: 'MongoError',
  message: 'operation exceeded time limit',
  ok: 0,
  errmsg: 'operation exceeded time limit',
  code: 50 } +10s
  startOplog 0] ------- Error ------- +1ms
  mongo-oplog stream ended +1ms
  startOplog 0] ------- End ------- +0ms
  startOplog-details 0] End object: undefined +0ms
  startOplog 0] ------- End ------- +0ms
  mongo-oplog oplog error {"name":"MongoError","message":"operation exceeded time limit","ok":0,"errmsg":"operation exceeded time limit","code":50} +1ms
  startOplog 2] ------- Error ------- +1ms

doktoroblivion avatar Jan 03 '18 17:01 doktoroblivion

I have made the following changes to stream.js which have made the symptoms better, but not perfect.

  async function time() {
    if (ts) return (typeof ts !== 'number') ? ts : Timestamp(0, ts)

    const doc = await coll
      .find({ }, { ts: 1 })
      .sort({ $natural: -1 })
      .limit(1)
      .nextObject()

    //return doc ? doc.ts : Timestamp(0, (Date.now() / 1000 | 0))
    return doc ? doc.ts : Timestamp(0, 0)
  }

  if (ns) query.ns = { $regex: regex(ns) }
  query.ts = { $gt: await time() }

  return (await coll.find(query, {
    "tailable": true,
    "awaitData": true,
    "oplogReplay": true,
    "noCursorTimeout": true,
    "maxTimeMS": 200,
    "numberOfRetries": Number.MAX_VALUE
  })).stream()

I am also trying the following, but do not have enough data points on how that will do yet.

  return (await coll.find(query, {
    "tailable": true,
    "awaitData": true,
    "oplogReplay": true,
    "maxTimeMS": 100,
    "numberOfRetries": Number.MAX_VALUE
  }).addCursorFlag('noCursorTimeout',true)).stream()

doktoroblivion avatar Jan 03 '18 20:01 doktoroblivion

So I chased this far enough to know that in the mongodb npm driver, collection.js checks for and translates timeout, ie.

  // Translate to new command option noCursorTimeout
  if(typeof newOptions.timeout == 'boolean') newOptions.noCursorTimeout = newOptions.timeout;

Therefore I made the following change to see if that would help at all.

    const doc = await coll
      .find({ }, { ts: 1, timeout: false })
      .sort({ $natural: -1 })
      .limit(1)
      .nextObject()

    //return doc ? doc.ts : Timestamp(0, (Date.now() / 1000 | 0))
    return doc ? doc.ts : Timestamp(0, 0)
  }

  if (ns) query.ns = { $regex: regex(ns) }
  query.ts = { $gt: await time() }

  return (await coll.find(query, {
    "tailable": true,
    "awaitData": true,
    "oplogReplay": true,
    "maxTimeMS": 100,
    "timeout": false,
    "numberOfRetries": Number.MAX_VALUE
  })).stream()

After some time it still throws the error, not on just one of my oplog streams, but all.

  startOplog 1] ------- End ------- +0ms
  startOplog-details +0ms] NO End object data
  startOplog 1] ------- End ------- +0ms
  mongo-oplog oplog error {"name":"MongoError","message":"operation exceeded time limit","ok":0,"errmsg":"operation exceeded time limit","code":50} +0ms
  startOplog 0] ------- Error ------- +0ms
  startOplog-details 0] Error object: { MongoError: operation exceeded time limit
    at Function.MongoError.create (/home/egriffin/mesos-infrastructure/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/error.js:31:11)
    at /home/egriffin/mesos-infrastructure/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/connection/pool.js:497:72
    at authenticateStragglers (/home/egriffin/mesos-infrastructure/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/connection/pool.js:443:16)
    at Connection.messageHandler (/home/egriffin/mesos-infrastructure/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/connection/pool.js:477:5)
    at TLSSocket.<anonymous> (/home/egriffin/mesos-infrastructure/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/connection/connection.js:333:22)
    at emitOne (events.js:115:13)
    at TLSSocket.emit (events.js:210:7)
    at addChunk (_stream_readable.js:252:12)
    at readableAddChunk (_stream_readable.js:239:11)
    at TLSSocket.Readable.push (_stream_readable.js:197:10)
  name: 'MongoError',
  message: 'operation exceeded time limit',
  ok: 0,
  errmsg: 'operation exceeded time limit',
  code: 50 } +0ms

doktoroblivion avatar Jan 04 '18 22:01 doktoroblivion

Next steps I hacked mongodb npm module, lib/collection.js

  // Translate to new command option noCursorTimeout
  if(typeof newOptions.timeout == 'boolean') newOptions.noCursorTimeout = newOptions.timeout;
  newOptions.noCursorTimeout = false;

  // Merge in options to command
  for(var name in newOptions) {
    if(newOptions[name] != null) {
 console.log("coll.js opt[%s]: %s", name, newOptions[name]);
      findCommand[name] = newOptions[name];
    }
  }

I see its getting set, not sure about its use yet:

coll.js opt[readPreference]: [object Object]
coll.js opt[tailable]: true
coll.js opt[awaitData]: true
coll.js opt[oplogReplay]: true
coll.js opt[maxTimeMS]: 100
coll.js opt[timeout]: false
coll.js opt[numberOfRetries]: 1.7976931348623157e+308
coll.js opt[skip]: 0
coll.js opt[limit]: 0
coll.js opt[slaveOk]: true
coll.js opt[noCursorTimeout]: false

doktoroblivion avatar Jan 04 '18 22:01 doktoroblivion

I think I found the real issue, the index.js code is not restarting the tail on timeout. So I am testing the following.

  function onerror(err) {
    if (/cursor (killed or )?timed out/.test(err.message)) {
      debug('cursor timeout - re-tailing %j', err)
      tail()
    } else if (/exceeded time limit/.test(err.message)) {
      debug('cursor exceeded timeout - re-tailing %j', err)
      tail()
    } else {
      debug('oplog error %j', err)
      oplog.emit('error', err)
    }
  }

doktoroblivion avatar Jan 08 '18 13:01 doktoroblivion

The above appears to work, I will create a PR to contrib this back. Since I am using TLS connections across the wire this may have something to do with you missing this particular cursor error, as well as perhaps the versions of node and npm I am using.

doktoroblivion avatar Jan 10 '18 12:01 doktoroblivion

Hey, did you ever create a PR? Was this issue solved? Seeing some problems with oplog tailing here as well, and just want to make sure which path to chase down on.

heipei avatar May 09 '18 07:05 heipei

@heipei No, I never created a PR for the above, but the above does appear to work for us. Its strange since I am not sure what triggers the other message and have not been able to track down the cause, other than possible internal private network contstraints (we do not use external IPs with oplog).

doktoroblivion avatar May 14 '18 17:05 doktoroblivion