puppeteer-cluster
puppeteer-cluster copied to clipboard
Roadmap for v1.0
I'm thinking about what kind of functionality this library should provide before it should be released as v1. I might edit the list in the future:
My goals:
- [ ] (#25) Make sure it's reliable and crawl more than 10 million pages with it (so far the maximum I crawled was ~800k pages)
- [x] (#9) ~~Improve
sameDomainDelayandskipDuplicateUrls. Detection of domains should use TLD.js for example. Documentation should be better. And there should be a way to provide the URL without using data or { url: ... }~~ Not a goal for 1.0 anymore - [ ] (#28) Optimize the code, fix code smells
- [x] More tests, get code coverage up to > 90%
- [ ] More documentation on the concurrency types. Maybe make
CONCURRENCY_BROWSERthe default as it is more robust? - [ ] More code snippets in the documentation page (for
Cluster.queuefor example) - [x] Provide a
cluster.executefunction which executes the job ~with higher priority (does not queue it at the end)~ and returns a Promise which is resolved when the job is finished. Might also solve this confusion: https://github.com/thomasdondorf/puppeteer-cluster/issues/10#issuecomment-419324832 - [ ] Statistics API: How many jobs in queue, how many jobs processes, etc.
- [x] #41 Offer more functionality, maybe provide a way to use puppeteer-extra?
- [x] #36 ~~Sandbox~~ Offer a way to run code from users in a sandbox, maybe even Docker? => This can now be implemented via custom concurrency implementations (although there are now custom implementations right now)
- [x] #70 Improve types
Maybe:
- [ ] Provide a simple but robust data store with the library
- [ ] Rename API: Some parts of API are rather unfortunate
concurrencyshould beconcurrencyTypemaxConcurrencymaybemaxWorkers?
- [ ] Provide queue function to the task function for a more functional syntax (so that you don't need to access cluster from inside the task
Not planned (for now):
- [x] ~~https://github.com/thomasdondorf/puppeteer-cluster/issues/8#issuecomment-421307994 Mixed concurrency models~~
- Reason: It does not work well together with the idea of having a sandbox (which part of the browser/page/context stuff should be sandboxed then)
I have a question. How many browsers I can spawn in parallel for processor core? Lets Say my server has processor with 4 cores. How many browsers I can spawn in one time for my tests to pass?
Next time, please open a separate issue if it has nothing to do with this issue.
Regarding your question: It depends on your use case. For simple DOM handling I was able to run ~10 worker on my machine (i5 quad core). Just give it a try with the option (monitor: true) and see how your machine is handling the tasks.
-
Add a mixed concurrency model. i.e for PAGE or CONTEXT concurrency model, have the option to distribute the jobs to more than one browser instance. So a crash won't affect all jobs and this offers a good balance between reliability and resource usage.
-
Add API to return the length of queue, time when the oldest item in queue was added and Number of jobs processed in the last minute. For a continuously operating cluster i.e jobs being added continuously, this information is valuable.
Unfortunately, the current implementation of custom concurrency doesn't address the case when you need to provide custom puppeteer parameters to jobInstances. IMHO this would effectively solve the #36 with puppeteer args: [ '--incognito', '--proxy-server=${proxyServer}' ] and await page.authenticate(credentials).
@thomasdondorf , what do you think about this?
I'm currently thinking about completely reworking the concurrency implementations. Then there would be no more "WorkerInstance" and "JobInstance". Just one function that is called when a page is needed. Then the concurrency implementation would have 100% flexibility when a puppeteer instance is started and when one is reused.
Expect some code changes in the next two weeks ;)
Cool, glad to hear that. Feel free to ping me if you need any help)
+1 for Docker container support. https://github.com/skalfyfan/dockerized-puppeteer
Is there a way to connect the puppeteer-cluster to a remote instance of chromium? (“connect” instead of “launch”)
Hello - just wanted to get a feel for how active this project is. I see puppeteer cluster as being useful for several projects I'd like to work on. However, I'm hesitant to use it if development will be abandoned. Is development still happening? Thanks!
(Long-term runs of puppteer-cluster #25) Make sure it's reliable and crawl more than 10 million pages with it (so far the maximum I crawled was ~800k pages)
I use k6 benchmarks in my CI tests for soketi, making sure all releases are passing benchmarks in most of the cases.
Would it be a great idea to set it up for you for page rendering testing?