crawlee
crawlee copied to clipboard
feat: add `TandemRequestProvider` for combined `RequestList` and `RequestQueue` usage
Overview
This PR introduces two new components to improve request management in Crawlee:
TandemRequestProvider: A provider that seamlessly combinesRequestListandRequestQueue, allowing crawlers to use both sources efficiently.RequestListAdapter: An adapter that makesRequestListcompatible with theIRequestProviderinterface.
Problem Solved
When using both RequestList and RequestQueue in a crawler, users often need to implement custom logic to ensure URLs aren't processed twice. This implementation standardizes and simplifies this pattern by:
- First processing requests from the
RequestList - Automatically transferring these requests to the
RequestQueuein the background - Ensuring each URL is processed exactly once
- Gracefully transitioning to
RequestQueueafter the list is exhausted
Implementation Details
- The
RequestListAdapterwraps aRequestListinstance and adapts its interface to matchIRequestProvider. - The
TandemRequestProviderorchestrates the flow between the list and queue:- It implements the
IRequestProviderinterface - Handles request processing through the queue while ensuring list requests are enqueued first
- Maintains proper request state tracking across both sources
- Implements a background transfer mechanism to move requests efficiently
- It implements the
Integration
The implementation is already integrated into BasicCrawler._initializeRequestProviders(), making it immediately available to all crawler types without requiring additional configuration changes.
Backward Compatibility
This change is fully backward compatible:
- Existing code using either
RequestListorRequestQueuealone will continue to work without changes - This introduces an enhancement without modifying existing behavior
Testing
The implementation was tested to ensure it correctly:
- Transfers requests from list to queue
- Processes requests in the correct order
- Handles errors appropriately
- Maintains proper request state
Closes #2499
Please merge the latest master into your branch so we only get your changes in the diff. Cheers!
@barjin Everything is done know you can review the PR thanks
Thank you for your initiative!
There are some things I'm not too sure about (see my comments). Also, I see that your branch is almost a year (~500 commits) behind our master. Could you please merge the latest
apify/crawleemaster into your branch, so we can test the changes later? Cheers!
Hii @barjin I am really too much confused about merging the latest master into my branch can you tell me how to do that please
Thank you for your initiative! There are some things I'm not too sure about (see my comments). Also, I see that your branch is almost a year (~500 commits) behind our master. Could you please merge the latest
apify/crawleemaster into your branch, so we can test the changes later? Cheers!Hii @barjin I am really too much confused about merging the latest master into my branch can you tell me how to do that please
See the GitHub documentation at https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork?platform=linux . I see you might have already merged our master branch into your master branch - you want to sync our master into your feat/tandem-request-provider.
Try going here and find the "Sync fork" button.
It's done now let me resolve the requested changes Thanks for guide
Hey @barjin Now the PR is ready to review/merge can you check and let me know if any changes required
Thanks @CodeMan62 for the contribution! I assume you're trying to fix https://github.com/apify/crawlee/issues/2499? If so, we should link it to this PR.
I haven't reviewed the code yet, but here are some points I want to make:
-
Eventually, we want the tandem class to be called
RequestManagerTandemso that it is consistent with the Python version of Crawlee. Can you renameIRequestProvidertoIRequestManageras well? TheRequestProviderand so on has to stay like that until we start working on a breaking release, so let's not rename that yet.
I still have some questions regarding the background transfer of the RequestList requests to the RequestQueue. I'm still not convinced this is the way to go here, especially since we've always done this lazily until now.
While issue https://github.com/apify/crawlee/issues/2499 mentions that we might want to transfer the request list independently of fetchNextRequest calls, I agree that we should tread carefully. Just extracting the tandem behavior as it is now should be sufficient. I made an initial draft a while back here https://github.com/apify/crawlee/pull/2898/files#diff-d631fbcd81c43142e84b9d2d01ead93451a95344001815f83c88f956ba14557a
Hey @barjin Thanks for the thorough review! I've addressed both concerns:
-
Changed RequestList -> RequestQueue transfer to be lazy and batched (25 requests at a time):
- Only transfers when fetchNextRequest() needs more requests
- Prevents memory spikes and unnecessary API calls
- Allows graceful stopping without wasted operations
-
Fixed the process hanging issue by removing background Promise:
- No more silent operations after crawler.stop()
- Process exits cleanly since transfers are now on-demand
The new implementation should handle large RequestLists efficiently while being safer to use. Let me know if you'd like any adjustments!
Hey @janbuchar Thanks for the feedback! I've made the following changes:
- Renamed
TandemRequestProvidertoRequestManagerTandemto align with Python Crawlee's naming - Updated the implementation to use lazy loading with batched transfers (25 requests at a time) instead of background transfer
- Left
IRequestProvideras is for now, waiting for the breaking release to rename it toIRequestManager
And yes, this PR is addressing #2499 . I've looked at your initial draft and implemented a similar approach with batched transfers that should be more efficient while maintaining the same functionality.
@janbuchar Regarding the rename to RequestManagerTandem - I've reverted this change for now. I think we should:
- Discuss the naming convention changes more broadly
- Plan this as part of a coordinated update with other renames
- Consider it for the next breaking release along with the IRequestProvider -> IRequestManager change
Let me know if you agree with this approach or if you'd prefer to handle the rename differently.
@barjin I have done the changes you have asked for take a look at the changes and let me know if any changes required
@janbuchar Regarding the rename to RequestManagerTandem - I've reverted this change for now. I think we should:
- Discuss the naming convention changes more broadly
- Plan this as part of a coordinated update with other renames
- Consider it for the next breaking release along with the IRequestProvider -> IRequestManager change
Let me know if you agree with this approach or if you'd prefer to handle the rename differently.
Hi @CodeMan62, i disagree. There is no good reason to introduce another set of names just to change them later. We should keep the current names backwards compatible for now, but where it's applicable, we should use the naming from the python port - that is the target state:
- instead of
TandemRequestProvider, there should beRequestManagerTandem - instead of
IRequestProvider, we should haveIRequestManager RequestProvidershould implementIRequestManager
@janbuchar No worries, let me change the names as you want
Hey @janbuchar I have changed the names as you want can you please review it and tell me know if any code changes required
This is shaping up nicely!
- please rename the files to match the classes in them
- could you add some tests to verify the newly added functionality? Also, the existing tests must pass
- please run the code formatter (
yarn format) to get rid of whitespace-only changes
Working on it today, almost done with the commit
Hey @janbuchar, I have renamed the files to match the classes in them and added tests, and all the tests are passing too, and also yarn format is done. Let me know if any changes are required! :)
Hi @CodeMan62, I still see some test failures - could you look into that please?
Hi @CodeMan62, I still see some test failures - could you look into that please?
Yes, I am trying to fix them :) EDIT:- Working on it
@janbuchar Can you help me fix this failing tests please?
@CodeMan62 I can try, but I'm stretched kinda thin right now. What did you already try? Is there anything you could tell me about the failing tests?
@CodeMan62 I can try, but I'm stretched kinda thin right now. What did you already try? Is there anything you could tell me about the failing tests?
almost 26 tests are failing let me try more to fix as much as I can and then let you know my mind is completely messed with these test but no worries let me put some more efforts
@janbuchar i think i really need your help in it i am not able to fix the tests help me in it please :pray:
@barjin @B4nan I updated the PR and made sure it passes tests. It should be fully backwards compatible. Can you take another look?