puppeteer-cluster icon indicating copy to clipboard operation
puppeteer-cluster copied to clipboard

Roadmap for v1.0

Open thomasdondorf opened this issue 7 years ago • 10 comments

I'm thinking about what kind of functionality this library should provide before it should be released as v1. I might edit the list in the future:

My goals:

  • [ ] (#25) Make sure it's reliable and crawl more than 10 million pages with it (so far the maximum I crawled was ~800k pages)
  • [x] (#9) ~~Improve sameDomainDelay and skipDuplicateUrls. Detection of domains should use TLD.js for example. Documentation should be better. And there should be a way to provide the URL without using data or { url: ... }~~ Not a goal for 1.0 anymore
  • [ ] (#28) Optimize the code, fix code smells
  • [x] More tests, get code coverage up to > 90%
  • [ ] More documentation on the concurrency types. Maybe make CONCURRENCY_BROWSER the default as it is more robust?
  • [ ] More code snippets in the documentation page (for Cluster.queue for example)
  • [x] Provide a cluster.execute function which executes the job ~with higher priority (does not queue it at the end)~ and returns a Promise which is resolved when the job is finished. Might also solve this confusion: https://github.com/thomasdondorf/puppeteer-cluster/issues/10#issuecomment-419324832
  • [ ] Statistics API: How many jobs in queue, how many jobs processes, etc.
  • [x] #41 Offer more functionality, maybe provide a way to use puppeteer-extra?
  • [x] #36 ~~Sandbox~~ Offer a way to run code from users in a sandbox, maybe even Docker? => This can now be implemented via custom concurrency implementations (although there are now custom implementations right now)
  • [x] #70 Improve types

Maybe:

  • [ ] Provide a simple but robust data store with the library
  • [ ] Rename API: Some parts of API are rather unfortunate
    • concurrency should be concurrencyType
    • maxConcurrency maybe maxWorkers?
  • [ ] Provide queue function to the task function for a more functional syntax (so that you don't need to access cluster from inside the task

Not planned (for now):

  • [x] ~~https://github.com/thomasdondorf/puppeteer-cluster/issues/8#issuecomment-421307994 Mixed concurrency models~~
    • Reason: It does not work well together with the idea of having a sandbox (which part of the browser/page/context stuff should be sandboxed then)

thomasdondorf avatar Sep 05 '18 18:09 thomasdondorf

I have a question. How many browsers I can spawn in parallel for processor core? Lets Say my server has processor with 4 cores. How many browsers I can spawn in one time for my tests to pass?

barpaw avatar Sep 11 '18 23:09 barpaw

Next time, please open a separate issue if it has nothing to do with this issue.

Regarding your question: It depends on your use case. For simple DOM handling I was able to run ~10 worker on my machine (i5 quad core). Just give it a try with the option (monitor: true) and see how your machine is handling the tasks.

thomasdondorf avatar Sep 12 '18 18:09 thomasdondorf

  1. Add a mixed concurrency model. i.e for PAGE or CONTEXT concurrency model, have the option to distribute the jobs to more than one browser instance. So a crash won't affect all jobs and this offers a good balance between reliability and resource usage.

  2. Add API to return the length of queue, time when the oldest item in queue was added and Number of jobs processed in the last minute. For a continuously operating cluster i.e jobs being added continuously, this information is valuable.

j-manu avatar Sep 14 '18 10:09 j-manu

Unfortunately, the current implementation of custom concurrency doesn't address the case when you need to provide custom puppeteer parameters to jobInstances. IMHO this would effectively solve the #36 with puppeteer args: [ '--incognito', '--proxy-server=${proxyServer}' ] and await page.authenticate(credentials).

@thomasdondorf , what do you think about this?

cyxou avatar Dec 21 '18 20:12 cyxou

I'm currently thinking about completely reworking the concurrency implementations. Then there would be no more "WorkerInstance" and "JobInstance". Just one function that is called when a page is needed. Then the concurrency implementation would have 100% flexibility when a puppeteer instance is started and when one is reused.

Expect some code changes in the next two weeks ;)

thomasdondorf avatar Dec 22 '18 12:12 thomasdondorf

Cool, glad to hear that. Feel free to ping me if you need any help)

cyxou avatar Dec 22 '18 13:12 cyxou

+1 for Docker container support. https://github.com/skalfyfan/dockerized-puppeteer

strarsis avatar Jun 29 '19 16:06 strarsis

Is there a way to connect the puppeteer-cluster to a remote instance of chromium? (“connect” instead of “launch”)

ermolaev1337 avatar Jul 23 '19 22:07 ermolaev1337

Hello - just wanted to get a feel for how active this project is. I see puppeteer cluster as being useful for several projects I'd like to work on. However, I'm hesitant to use it if development will be abandoned. Is development still happening? Thanks!

generic11 avatar Oct 15 '20 14:10 generic11

(Long-term runs of puppteer-cluster #25) Make sure it's reliable and crawl more than 10 million pages with it (so far the maximum I crawled was ~800k pages)

I use k6 benchmarks in my CI tests for soketi, making sure all releases are passing benchmarks in most of the cases.

Would it be a great idea to set it up for you for page rendering testing?

rennokki avatar Jan 25 '22 12:01 rennokki