crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

RequestQueue locks might get lost in specific scenarios

Open janbuchar opened this issue 10 months ago • 5 comments

Error scenario

  1. fetch and lock 25 requests in RequestQueue, lock time is 60s
  2. each takes 10s, which is within the request handler timeout
  3. after a couple of requests, the "locally dequeued" requests won't be locked anymore

This becomes worse if the user runs a CPU-bound thing that stalls the node.js event loop.

Possible solutions

  1. ignore it
  2. run a background loop that periodically prolongs all held locks
  3. prolong locks in fetchNextRequest - that should be called at least once per requestHandlerTimeout

janbuchar avatar Feb 17 '25 10:02 janbuchar

Hii @janbuchar I've run several tests to try to reproduce the RequestQueue lock issue with different configurations:

  1. Initial test with 25 requests and 5 concurrent requests:

    • All requests processed successfully
    • No lock issues observed
    • Average request duration: ~12 seconds
  2. More aggressive test with 100 requests and 20 concurrent requests:

    • All requests processed successfully
    • No lock issues observed
    • Average request duration: ~4.7 seconds
    • Requests finished per minute: 61
  3. Most aggressive test with 200 requests and 50 concurrent requests:

    • All requests processed successfully
    • No lock issues observed
    • Average request duration: ~2.8 seconds
    • Requests finished per minute: 52-54
    • Added random delays (500ms-3000ms) to simulate network latency

In all test cases:

  • No failed requests
  • No duplicate processing
  • No lost requests
  • Proper ordering of request processing
  • System resources were well-managed (occasional event loop overload but system recovered)

The RequestQueue implementation appears to be handling concurrent access correctly:

  • Requests are properly locked while being processed
  • No race conditions observed
  • Queue maintains proper ordering
  • Failed requests are properly retried

Could you provide more details about:

  1. The specific conditions under which you're seeing the lock issues?
  2. Any error messages or symptoms you're observing?
  3. Your crawler configuration (concurrency, request handler timeout, etc.)?
  4. Whether you're seeing this in a specific environment (local, cloud, etc.)?

This would help us better understand and reproduce the issue you're experiencing.

CodeMan62 avatar Apr 05 '25 05:04 CodeMan62

One more thing i am trying to do this from past 1 hour so please provide more details

CodeMan62 avatar Apr 05 '25 05:04 CodeMan62

Imo the minimal reproduction scenario is here:

import { setTimeout } from "timers/promises";
import { Configuration, RequestQueueV2 } from '@crawlee/core';
import { Worker, isMainThread } from 'worker_threads';

async function initializeRq() {
    const requestQueue = await RequestQueueV2.open(null);

    await requestQueue.addRequests([
        { url: 'https://example.com/0' },
        { url: 'https://example.com/1' },
        { url: 'https://example.com/2' },
    ]);
}

async function main() {
    const requestQueue = await RequestQueueV2.open(null, {
        config: new Configuration({
            purgeOnStart: false,
        })
    });

    requestQueue.requestLockSecs = 1;

    console.log(`[${isMainThread + 1}] ${(await requestQueue.fetchNextRequest())?.url} [${Date.now()}]`);
    await setTimeout(1000);
    console.log(`[${isMainThread + 1}] ${(await requestQueue.fetchNextRequest())?.url} [${Date.now()}]`);
    await setTimeout(1000);
    console.log(`[${isMainThread + 1}] ${(await requestQueue.fetchNextRequest())?.url} [${Date.now()}]`);
}

if (isMainThread) {
    await initializeRq();
    new Worker(new URL(import.meta.url));
    await setTimeout(2000);
}

main();

The main function fetches three requests from the default RequestQueueV2 instance and prints the urls.

In the script, we run the main function twice - in the main thread and in the worker thread. The worker's lock on the first 25 requests elapses right when the main thread calls the first fetchNextRequest, which causes both to access the same requests at the same time (thinking they have them exclusively locked).

See the output of this script:

[1] https://example.com/0 [1743879224089]
[1] https://example.com/2 [1743879225094]
[2] https://example.com/1 [1743879225692]
[1] https://example.com/1 [1743879226106]
[2] undefined [1743879226697]
[2] https://example.com/0 [1743879227708]

Note that the worker thread ([1]) accessed the https://example.com/1 at 1743879226106, i.e. 414 milliseconds after the main thread ([2]) fetched it. This is violating the requestLockSecs argument we set for both RequestQueue instances.

barjin avatar Apr 05 '25 18:04 barjin

hey @barjin thanks for guide i already have one PR open once that get done then i will make a PR to fix this issue

CodeMan62 avatar Apr 06 '25 03:04 CodeMan62

I ran into a bug in RequestQueue, that is causing some Requests to be "successfully" handled multiple times.

I am not sure if it is SDK or Apify Platform related it. It happens on Apify Platform with default RequestQueue (V2 with locking) without any extra settings, it doesn't happen with crawler option:

experiments: {
    requestLocking: false,
},

When this option is set everything is working as expected.

Based on log, there were 8 requests handled, but by RQ there was only 5 of them made, some of them were handled multiple times.

JJetmar avatar Nov 05 '25 10:11 JJetmar