crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

feat: add `TandemRequestProvider` for combined `RequestList` and `RequestQueue` usage

Open CodeMan62 opened this issue 7 months ago • 22 comments

Overview

This PR introduces two new components to improve request management in Crawlee:

  • TandemRequestProvider: A provider that seamlessly combines RequestList and RequestQueue, allowing crawlers to use both sources efficiently.
  • RequestListAdapter: An adapter that makes RequestList compatible with the IRequestProvider interface.

Problem Solved

When using both RequestList and RequestQueue in a crawler, users often need to implement custom logic to ensure URLs aren't processed twice. This implementation standardizes and simplifies this pattern by:

  1. First processing requests from the RequestList
  2. Automatically transferring these requests to the RequestQueue in the background
  3. Ensuring each URL is processed exactly once
  4. Gracefully transitioning to RequestQueue after the list is exhausted

Implementation Details

  • The RequestListAdapter wraps a RequestList instance and adapts its interface to match IRequestProvider.
  • The TandemRequestProvider orchestrates the flow between the list and queue:
    • It implements the IRequestProvider interface
    • Handles request processing through the queue while ensuring list requests are enqueued first
    • Maintains proper request state tracking across both sources
    • Implements a background transfer mechanism to move requests efficiently

Integration

The implementation is already integrated into BasicCrawler._initializeRequestProviders(), making it immediately available to all crawler types without requiring additional configuration changes.

Backward Compatibility

This change is fully backward compatible:

  • Existing code using either RequestList or RequestQueue alone will continue to work without changes
  • This introduces an enhancement without modifying existing behavior

Testing

The implementation was tested to ensure it correctly:

  • Transfers requests from list to queue
  • Processes requests in the correct order
  • Handles errors appropriately
  • Maintains proper request state

Closes #2499

CodeMan62 avatar Apr 05 '25 18:04 CodeMan62

Please merge the latest master into your branch so we only get your changes in the diff. Cheers!

barjin avatar Apr 05 '25 18:04 barjin

@barjin Everything is done know you can review the PR thanks

CodeMan62 avatar Apr 05 '25 19:04 CodeMan62

Thank you for your initiative!

There are some things I'm not too sure about (see my comments). Also, I see that your branch is almost a year (~500 commits) behind our master. Could you please merge the latest apify/crawlee master into your branch, so we can test the changes later? Cheers!

Hii @barjin I am really too much confused about merging the latest master into my branch can you tell me how to do that please

CodeMan62 avatar Apr 07 '25 12:04 CodeMan62

Thank you for your initiative! There are some things I'm not too sure about (see my comments). Also, I see that your branch is almost a year (~500 commits) behind our master. Could you please merge the latest apify/crawlee master into your branch, so we can test the changes later? Cheers!

Hii @barjin I am really too much confused about merging the latest master into my branch can you tell me how to do that please

See the GitHub documentation at https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork?platform=linux . I see you might have already merged our master branch into your master branch - you want to sync our master into your feat/tandem-request-provider.

Try going here and find the "Sync fork" button.

barjin avatar Apr 07 '25 13:04 barjin

It's done now let me resolve the requested changes Thanks for guide

CodeMan62 avatar Apr 07 '25 14:04 CodeMan62

Hey @barjin Now the PR is ready to review/merge can you check and let me know if any changes required

CodeMan62 avatar Apr 08 '25 13:04 CodeMan62

Thanks @CodeMan62 for the contribution! I assume you're trying to fix https://github.com/apify/crawlee/issues/2499? If so, we should link it to this PR.

I haven't reviewed the code yet, but here are some points I want to make:

  1. Eventually, we want the tandem class to be called RequestManagerTandem so that it is consistent with the Python version of Crawlee. Can you rename IRequestProvider to IRequestManager as well? The RequestProvider and so on has to stay like that until we start working on a breaking release, so let's not rename that yet.

I still have some questions regarding the background transfer of the RequestList requests to the RequestQueue. I'm still not convinced this is the way to go here, especially since we've always done this lazily until now.

While issue https://github.com/apify/crawlee/issues/2499 mentions that we might want to transfer the request list independently of fetchNextRequest calls, I agree that we should tread carefully. Just extracting the tandem behavior as it is now should be sufficient. I made an initial draft a while back here https://github.com/apify/crawlee/pull/2898/files#diff-d631fbcd81c43142e84b9d2d01ead93451a95344001815f83c88f956ba14557a

janbuchar avatar Apr 09 '25 09:04 janbuchar

Hey @barjin Thanks for the thorough review! I've addressed both concerns:

  1. Changed RequestList -> RequestQueue transfer to be lazy and batched (25 requests at a time):

    • Only transfers when fetchNextRequest() needs more requests
    • Prevents memory spikes and unnecessary API calls
    • Allows graceful stopping without wasted operations
  2. Fixed the process hanging issue by removing background Promise:

    • No more silent operations after crawler.stop()
    • Process exits cleanly since transfers are now on-demand

The new implementation should handle large RequestLists efficiently while being safer to use. Let me know if you'd like any adjustments!

CodeMan62 avatar Apr 09 '25 14:04 CodeMan62

Hey @janbuchar Thanks for the feedback! I've made the following changes:

  1. Renamed TandemRequestProvider to RequestManagerTandem to align with Python Crawlee's naming
  2. Updated the implementation to use lazy loading with batched transfers (25 requests at a time) instead of background transfer
  3. Left IRequestProvider as is for now, waiting for the breaking release to rename it to IRequestManager

And yes, this PR is addressing #2499 . I've looked at your initial draft and implemented a similar approach with batched transfers that should be more efficient while maintaining the same functionality.

CodeMan62 avatar Apr 09 '25 15:04 CodeMan62

@janbuchar Regarding the rename to RequestManagerTandem - I've reverted this change for now. I think we should:

  1. Discuss the naming convention changes more broadly
  2. Plan this as part of a coordinated update with other renames
  3. Consider it for the next breaking release along with the IRequestProvider -> IRequestManager change

Let me know if you agree with this approach or if you'd prefer to handle the rename differently.

CodeMan62 avatar Apr 11 '25 12:04 CodeMan62

@barjin I have done the changes you have asked for take a look at the changes and let me know if any changes required

CodeMan62 avatar Apr 14 '25 11:04 CodeMan62

@janbuchar Regarding the rename to RequestManagerTandem - I've reverted this change for now. I think we should:

  1. Discuss the naming convention changes more broadly
  2. Plan this as part of a coordinated update with other renames
  3. Consider it for the next breaking release along with the IRequestProvider -> IRequestManager change

Let me know if you agree with this approach or if you'd prefer to handle the rename differently.

Hi @CodeMan62, i disagree. There is no good reason to introduce another set of names just to change them later. We should keep the current names backwards compatible for now, but where it's applicable, we should use the naming from the python port - that is the target state:

  • instead of TandemRequestProvider, there should be RequestManagerTandem
  • instead of IRequestProvider, we should have IRequestManager
  • RequestProvider should implement IRequestManager

janbuchar avatar Apr 14 '25 12:04 janbuchar

@janbuchar No worries, let me change the names as you want

CodeMan62 avatar Apr 14 '25 12:04 CodeMan62

Hey @janbuchar I have changed the names as you want can you please review it and tell me know if any code changes required

CodeMan62 avatar Apr 15 '25 18:04 CodeMan62

This is shaping up nicely!

  • please rename the files to match the classes in them
  • could you add some tests to verify the newly added functionality? Also, the existing tests must pass
  • please run the code formatter (yarn format) to get rid of whitespace-only changes

Working on it today, almost done with the commit

CodeMan62 avatar Apr 17 '25 08:04 CodeMan62

Hey @janbuchar, I have renamed the files to match the classes in them and added tests, and all the tests are passing too, and also yarn format is done. Let me know if any changes are required! :)

CodeMan62 avatar Apr 17 '25 09:04 CodeMan62

Hi @CodeMan62, I still see some test failures - could you look into that please?

janbuchar avatar Apr 25 '25 14:04 janbuchar

Hi @CodeMan62, I still see some test failures - could you look into that please?

Yes, I am trying to fix them :) EDIT:- Working on it

CodeMan62 avatar Apr 25 '25 14:04 CodeMan62

@janbuchar Can you help me fix this failing tests please?

CodeMan62 avatar Apr 29 '25 19:04 CodeMan62

@CodeMan62 I can try, but I'm stretched kinda thin right now. What did you already try? Is there anything you could tell me about the failing tests?

janbuchar avatar Apr 30 '25 12:04 janbuchar

@CodeMan62 I can try, but I'm stretched kinda thin right now. What did you already try? Is there anything you could tell me about the failing tests?

almost 26 tests are failing let me try more to fix as much as I can and then let you know my mind is completely messed with these test but no worries let me put some more efforts

CodeMan62 avatar Apr 30 '25 14:04 CodeMan62

@janbuchar i think i really need your help in it i am not able to fix the tests help me in it please :pray:

CodeMan62 avatar May 06 '25 08:05 CodeMan62

@barjin @B4nan I updated the PR and made sure it passes tests. It should be fully backwards compatible. Can you take another look?

janbuchar avatar Jul 28 '25 11:07 janbuchar