gpt-crawler icon indicating copy to clipboard operation
gpt-crawler copied to clipboard

Rate Limiting, Max Concurrency, Infinite Crawl & Additional Configurations

Open cpdata opened this issue 2 years ago • 8 comments

Initial Improvements

Main Additions

  • maxPagesToCrawl if = 0 then will crawl all matching urls and display during progress as 1/∞.
  • maxConcurrency Sets the number of concurrent crawl requests. If left unset then the undefined maxConcurrency will do maximum parallel connections like the originals default. Now defaults to 1 to avoid getting IP banned.
  • waitPerPageCrawlTimeoutRange Defaults to a range of 1 second to 1 second but can be set to create a random delay between any 2 numbers in milliseconds to avoid rate limit rejection when crawling.
  • headless is true by default but can now be configured in the config.ts file for situations that require it.
  • Improved README.md & config.ts documentation. ( More to be done.)

Full Summery

I would like to contribute to this project on a regular basis. I have a lot of Web-scraping, A.I./LLMs, CI/CD, Automation, experience and would like to discuss with the main collaborators and see were I can be of the most use.

cpdata avatar Dec 04 '23 13:12 cpdata

I updated with prettier formatting for the files that failed README.md, src/config.ts, src/core.ts, and config.ts. I also added the formatting for jsdoc/typedoc as recommened by @marcelovicentegc in response to my orginal pull request #102. Additionally, I added .prettierignore file.

cpdata avatar Dec 06 '23 23:12 cpdata

@marcelovicentegc this look good to you to merge?

steve8708 avatar Dec 22 '23 19:12 steve8708

@marcelovicentegc this look good to you to merge?

Hey @steve8708! Happy new years! One rebase and a few nitpicks ☝️ and it occurs to me that we are good to go 🤗

marcelovicentegc avatar Jan 04 '24 14:01 marcelovicentegc

Please merge this branch ASAP!

Ademrobert avatar Jan 06 '24 19:01 Ademrobert