captcha icon indicating copy to clipboard operation
captcha copied to clipboard

Slow dataset queries

Open forgetso opened this issue 10 months ago • 2 comments

We are using $sample 2 when getting captchas. This is causing slow queries on the nodes. We need to change this approach as follows:

  1. Create an index on { datasetId: 1, solved: 1 }

  2. Instead of $sample, use a random selection method to improve performance. For example:

  • Add a random field to each document at insertion time.
  • Index this field.
  • Query using $gte or $lte to efficiently retrieve random documents.
  1. Use $limit Before $sample

Instead of sampling from the entire dataset, limit the query first:

db.captchas.aggregate([
  { $match: { datasetId: "0xe666b35451f302b9fccfbe783b1de9a6a4420b840abed071931d68a9ccc1c21d", solved: true } },
  { $limit: 1000 },  // Get a subset first
  { $sample: { size: 2 } },  // Then sample from that subset
  { $project: { datasetId: 1, datasetContentId: 1, captchaId: 1, captchaContentId: 1, items: 1, target: 1 } }
]);

This reduces the number of documents MongoDB has to scan.

forgetso avatar Feb 01 '25 09:02 forgetso

aggregate has no ordering so you don't need the random field

goastler avatar Feb 03 '25 09:02 goastler

https://github.com/prosopo/captcha/pull/1705

forgetso avatar Mar 05 '25 15:03 forgetso