mongoose icon indicating copy to clipboard operation
mongoose copied to clipboard

Mongoose findOneAndUpdate hangs under heavy load

Open andreialecu opened this issue 1 year ago • 26 comments

Prerequisites

  • [X] I have written a descriptive issue title
  • [X] I have searched existing issues to ensure the bug has not already been reported

Mongoose version

8.5.1

Node.js version

20.17.0

MongoDB server version

7.0.12

Typescript version (if applicable)

No response

Description

We have a batch job that does a bunch of updates that used to work a while ago, but recently started hanging on calls to findOneAndUpdate. The promise stops resolving after a while.

I have been able to create a repro for this, but it seems to only viably reproduce on MongoDB Atlas (or probably other remote servers). On localhost it doesn't seem to reproduce.

Also if I bypass mongoose entirely and use the mongodb driver directly, it seems to work.

Steps to Reproduce

const mongoose = require('mongoose');
const Parallel = require('async-parallel');
const { setTimeout } = require('timers/promises');

const ObjectId = mongoose.Types.ObjectId;
const Schema = mongoose.Schema;

const testSchema = new Schema({
  fieldA: {
    type: Schema.Types.ObjectId,
    required: true,
  },
  fieldB: {
    type: String,
    required: true,
  },
  fieldC: {
    type: Number,
    default: 0,
  },
  fieldD: {
    subFieldA: [String],
    subFieldB: [String],
    subFieldC: [String],
  },
  fieldK: {
    type: Date,
    default: Date.now,
  },
});

const Test = mongoose.model('Test', testSchema);

async function createObject() {
  await setTimeout(100 + Math.random() * 100);
  return {
    fieldA: new ObjectId(),
    // random hex
    fieldB: new ObjectId().toHexString(),
    fieldC: 0,
    fieldD: {
      // make this big:
      subFieldA: Array.from({ length: 2000 }, () =>
        new ObjectId().toHexString(),
      ),
      subFieldB: Array.from({ length: 2000 }, () =>
        new ObjectId().toHexString(),
      ),
      subFieldC: Array.from({ length: 2000 }, () =>
        new ObjectId().toHexString(),
      ),
    },
  };
}

async function run() {
  await mongoose.connect(
    // 'mongodb://localhost:27017/testmongoose', // could not reproduce on localhost
    'mongodb+srv://....mongodb.net/testmongoose?ssl=true&authSource=admin', // replace with a mongodb atlas connection string
  );

  let tidx = 0;
  let total = 10000;
  await Parallel.each(
    Array.from({ length: total }),
    async (_, i) => {
      const obj = await createObject();
      const result = await Promise.race([
        Test.findOneAndUpdate(
          { fieldB: obj.fieldB, fieldK: { $lt: new Date() } },
          obj,
          {
            upsert: true,
          },
        ),
        // following is faster and doesn't have the issue
        // Test.db.collection('test123').findOneAndUpdate(
        //     { fieldB: obj.fieldB, fieldK: { $lt: new Date() } },
        //     { $set: obj },
        //     {
        //       upsert: true,
        //     },
        //   ),
        setTimeout(30000, new Error('timeout')),
      ]);
      if (result instanceof Error) {
        console.log(`Timeout after ${tidx}/${total}`);
        process.exit(1);
      }
      tidx++;
      console.log(`${tidx}/${total}`);
    },
    { concurrency: 20 },
  );

  console.log('done');
  mongoose.disconnect();
}

run();

Expected Behavior

This must've started happening after some recent update to either mongoose, mongodb, or a mongo version. Unclear which, we haven't ran this job in a while and just noticed it occuring now.

andreialecu avatar Sep 10 '24 13:09 andreialecu

Output of script:

....
1223/10000
1224/10000
Timeout after 1224/10000

If mongoose is bypassed and the other code that is commented out is swapped in, the issue doesn't seem to occur.

andreialecu avatar Sep 10 '24 13:09 andreialecu

It seems that if you make the schema bigger/more bloated with additional fields, even if they are not populated, the issue occurs earlier after about 400 iterations. But this could be a fluke.

andreialecu avatar Sep 10 '24 13:09 andreialecu

I have been fighting this issue have have not been able to go above version 8.2.2 Any higher version and I will eventually get hangs doing around 50+ updates / second. And we do have a rather large, complex schema defined. I have also had it hang on simple findOne calls. But I have never reproduced outside of production.

What I have found in production is that the database connection is checked out from the pool and remains checked out. The connection on db side, using db.currentOp, does not appear. My count of "checked out" connections eventually grows to the limit, while the number of connections (in use or idle) on db server remains very low.

I assume what I experience must be mongodb, since everything from where the connection is checked out to where it is checked back in, is down in the mongodb driver.

hunkydoryrepair avatar Sep 11 '24 08:09 hunkydoryrepair

Here is a function I'm using to look at the mongodb internals and see if the connections have been checked out from the connection pool. I was suspecting that it was a connection pool issue, but that proved to be wrong. Could you check if you are leaking connections from connection pool in your tests?

const connectionStats = () => {
  const client = mongoose.connection?.getClient();
  if (!client) return `Mongo Client Not Loaded`;
  try {
    const topology = (client as any).topology?.s;
    let s =  `Mongo: ${topology?.state}`;
    topology?.servers?.forEach((server: any, key: string) => {
      s += `<br>${key}: `;
      s += ' CO: ' + server.pool?.currentCheckedOutCount ?? '?';
      s += ' P: ' + server.pool?.pendingConnectionCount ?? '?';
      s += ' Q: ' + server.pool?.waitQueueSize ?? '?';
      s += ' A: ' + server.pool?.availableConnectionCount ?? '?';
      if (server.pool?.loadBalanced) {
        s += '[LOAD BALANCED]';
      }
    });
    return s;
  } catch (e) {
    console.log(`mongo pool hack didn't work`, e);
    return `Mongo: Error`;
  }

}

hunkydoryrepair avatar Sep 11 '24 08:09 hunkydoryrepair

I have been fighting this issue have have not been able to go above version 8.2.2

Thank you for this!

I've tried running my repro on several mongoose versions, and here are my findings:

8.3.0 repros the issue 8.2.3 seems to work fine

The issue starts in 8.3.0 /cc @vkarpov15

andreialecu avatar Sep 11 '24 09:09 andreialecu

here's what I found: Issue still in 8.6.1 8.3.0 fails indeed, which is when mongodb driver was updated to 6.5. 8.6.1 uses version 6.8.

If we are experiencing the same thing setting socketTimeoutMS will cause it to eventually throw MongoNetworkTimeoutError

hunkydoryrepair avatar Sep 11 '24 09:09 hunkydoryrepair

Update: I used yarn resolutions to pin mongodb to 6.3.0 and it appears that the problem no longer reproduces with my test script.

Indeed this may be an upstream issue in the mongodb nodejs driver.

andreialecu avatar Sep 11 '24 10:09 andreialecu

It appears that the problem started with mongodb 6.4.0 - I've been trying various resolutions. 6.3.0 works, 6.4.0 exhibits the problem.

andreialecu avatar Sep 11 '24 10:09 andreialecu

This may be the same issue as:

https://jira.mongodb.org/projects/NODE/issues/NODE-6370 https://jira.mongodb.org/browse/NODE-6166

andreialecu avatar Sep 11 '24 11:09 andreialecu

I can confirm that we resolved our issue with the workaround of pinning mongodb to 6.3.0 using yarn resolutions.

This is not a mongoose issue. mongoose 8.5.1 with mongodb 6.3.0 seems to work.

I'll keep this issue open since the mongodb node driver doesn't have an issue tracker, and perhaps someone from the mongodb team sees it here.

A temporary fix for mongoose might be to revert the mongodb driver bump back to 6.3.0

andreialecu avatar Sep 11 '24 13:09 andreialecu

i think worth keeping open because Mongoose should at least have some option for using the working 6.3.0

hunkydoryrepair avatar Sep 11 '24 16:09 hunkydoryrepair

Seems it might be a good idea to have mongoose define a peer dependency to mongodb instead of a direct dependency. In this case the version of the driver would be user controllable.

Not sure why it's not already that way.

andreialecu avatar Sep 11 '24 16:09 andreialecu

I'm unable to repro this on my end with Mongoose 8.6.2, although I had to set upsert to false to avoid maxing out my Atlas storage limits.

Mongoose has always pinned exact versions of the MongoDB Node driver because in the past Mongoose was more tightly coupled with the MongoDB Node driver and patch releases would cause breaking issues. However, that hasn't been the case in some time, so we've considered relaxing Mongoose's Node driver version requirements.

@andreialecu if we changed Mongoose's dependencies to "mongodb": "^6.3.0" would that help? Mongoose 8.6 tests do pass with [email protected], 6.4, 6.5, 6.6, and 6.7.

vkarpov15 avatar Sep 13 '24 16:09 vkarpov15

That would probably work, but I think making mongodb a peer dependency might still be a good idea.

Or a peer dependency with a default. See: https://github.com/yarnpkg/berry/issues/3710#issuecomment-963732582

andreialecu avatar Sep 13 '24 16:09 andreialecu

You can probably get it to repro if you use upsert and update the same document repeatedly.

I haven't tried optimizing the repro.

I noticed it happens easier with bigger documents. The linked issues in the mongodb issue tracker also mention this.

It may still happen with smaller documents, just at a lower frequency. We've had some weird issues for a while which I think we can attribute to this problem.

andreialecu avatar Sep 13 '24 17:09 andreialecu

I have a test case that seems pretty reliable, at least with atlas. It might hit storage limits, but it is a simple Model.create([]) with a large array of data. I haven't cleaned out some confidential data (but not personal information of any kind, and not too big a deal if it gets out). I could provide to you @vkarpov15

I'm not certain if it requires the schema or not, but it's about 600 fairly large objects, and from what I can tell it would be kicking off 600 insertOne calls simultaneously, which should pretty much max out the connection pool

hunkydoryrepair avatar Sep 15 '24 04:09 hunkydoryrepair

npm doesn't install peer dependencies by default AFAIK, which makes making mongodb a peer dep likely a non-starter.

@hunkydoryrepair can you please send me the script? You can just email me val at karpov dot io, if that's preferred to posting the script on GitHub.

vkarpov15 avatar Sep 16 '24 14:09 vkarpov15

npm doesn't install peer dependencies by default AFAIK, which makes making mongodb a peer dep likely a non-starter.

It does since 2020: https://github.blog/news-insights/product-news/presenting-v7-0-0-of-the-npm-cli/

This has been actually quite controversial. The Yarn maintainers were vehemently opposed to this default.

Yarn and I think pnpm will not install peer dependencies automatically, but they will install peer dependencies with defaults (see my previous comment).

andreialecu avatar Sep 16 '24 14:09 andreialecu

the case I thought was the same is a different bug, I think. I have a solid repro case but it works in 8.4.0 reliably, and consistently hangs in 8.4.1 or higher. I will open a separate bug

hunkydoryrepair avatar Sep 20 '24 02:09 hunkydoryrepair

@andreialecu I just discovered a reproducible case that is a little more direct. I believe it has to do with the IDLE CONNECTION TIMEOUT from the server. On Atlas, it is 10 minutes, but running locally it can be much, much longer. I added a 10 minute pause in the middle of my test cases and reproduced it much easier

hunkydoryrepair avatar Sep 20 '24 21:09 hunkydoryrepair

The test case I shared in my initial post was 100% reproducible to me.

You weren't able to reproduce it?

It's possible it could be related to latency to Atlas as I was running it locally and my ping to the cluster is 30-60ms.

andreialecu avatar Sep 20 '24 22:09 andreialecu

I wasn't able to consistently reproduce with it, even when connecting to Atlas. It timed out a few times for me, but it was pretty slow other times, so it wasn't super convincing. And i go through a VPN to connect to Atlas so it is pretty slow, but I'm not sure it was reaching the idle timeouts. And the complicated schema ones seem to be a different bug from what I can tell, as it does some N! operations where N=document depth, so even getting 10 levels deep would seem to hang.

You also said you could not produce locally, and I think this is why.

hunkydoryrepair avatar Sep 20 '24 22:09 hunkydoryrepair

I was only able to reproduce it with certain mongodb driver versions, and it was consistent.

In our production app, pinning to the driver version I mentioned resolved the issue.

andreialecu avatar Sep 20 '24 22:09 andreialecu

@andreialecu Thank you. You inspired me to dig into this again using your use case. The issue I ran into is that your repro stops producing the problem if it is run multiple times, but I added a Test.deleteMany({}) at the beginning, and it was suddenly possible.

I spent the day tracking it down in the debugger and found a solution, which I added to this issue. Mongodb team didn't put the listeners on the 'data' event of the socket in time to catch all of the 'data' events, so it misses the returned value. This happens only with larger commands which exceed the kernel socket buffer, which causes the socket.write command to return a value indicating it needs to be drained, and the mongodb driver was awaiting that event before listening for the 'data' event, and thus, before the task awaiting the 'drain' is processed, the 'data' event comes back and is emitted. Anyway....here's hoping they accept my solution!

https://jira.mongodb.org/projects/NODE/issues/NODE-6370

hunkydoryrepair avatar Sep 21 '24 19:09 hunkydoryrepair

Can we keep this open until a fix from mongodb is available and incorporated?

hunkydoryrepair avatar Sep 21 '24 19:09 hunkydoryrepair

mongodb driver fix has been pulled into the 6.10 branch. Next release should have a fix

hunkydoryrepair avatar Oct 01 '24 20:10 hunkydoryrepair