gpt-crawler
gpt-crawler copied to clipboard
Rate Limiting, Max Concurrency, Infinite Crawl & Additional Configurations
Initial Improvements
Main Additions
maxPagesToCrawlif = 0 then will crawl all matching urls and display during progress as 1/∞.maxConcurrencySets the number of concurrent crawl requests. If left unset then theundefinedmaxConcurrency will do maximum parallel connections like the originals default. Now defaults to 1 to avoid getting IP banned.waitPerPageCrawlTimeoutRangeDefaults to a range of 1 second to 1 second but can be set to create a random delay between any 2 numbers in milliseconds to avoid rate limit rejection when crawling.headlessistrueby default but can now be configured in the config.ts file for situations that require it.- Improved README.md & config.ts documentation. ( More to be done.)
Full Summery
-
Added *.code-workspace to .gitignore for VSCODE workspaces saved in the root of the project. Add VSCode workspace file in .gitignore
-
Final output .json files go to
outputs/folder so they are not overwritten. Add outputs dir to .gitignore for final outputs -
Dynamic domain + date-timestamp final output file name ex. outputs/domain.com-2023-11-28-12:02:51.json Add Dynamic OutputFileName based on date-timestamp
-
maxPagesToCrawl: if set to 0 will continue crawling for all matching URLs and display infinity symbol ex. 1/∞, 2/∞, 3/∞ etc.( default = 50 ) Allow maxPagesToCrawl to be optional and infinite by setting 0 which will display the infinity symbol -
maxConcurrency: Some sites will automatically block connections to prevent DDOS attacks. This config sets how many concurrent requests run at a time. ( default = 1 ) Added maxConcurrency config to set maximum concurrent parallel requests. -
Updates to core.ts to add config paramters for maxPagesToCraw, maxConcurrency, maxRequestsPerCrawl, headless
-
waitPerPageCrawlTimeoutRangeconfig added to set a random range in milliseconds between requests. Some sites will automatically block connections so this is a 2 number object that introduces a random delay between requests for rate limit handling( default = 1000 ) Update to core.ts for maxPagesToCrawl -
headlessis now a config option ( default = true ) Addded headless mode as a configuration parameter -
Random Rate Limiting Range with
waitPerPageCrawlTimeoutRangeconfig. Added waitPerPageCrawlTimeoutRange for a random range in milliseconds between page requests to help with rate limiting -
1 line improvement to prevent VSCODE warning for non-existent docker container. Added ts-ignore for docker config.ts to prevent VSCode from declaring missing file that isn't created until the Docker is.
-
Chunked data goes into the
storagedir. Final compiled JSON file outputs go into the newoutputsdirectory. Added Output Directory for all outputFileName to go into so they aren't overwritten in storage -
Added more variables to the ./config.ts file for setting up the config in a more customized way that also includes the automatic naming convention domain-timestamp.json Additions to dynamic url and match configurations in config.ts
-
Added details for waitForSelectorTimeout in the README.md file Added waitForSelectorTimeout to README.md
-
Added additional Markdown and Typescript formatting to the config.ts and README.md files. Adding details to README.md and config.ts as well as extra formatting.
-
13 Commits hopefully makes review a little easier.
I would like to contribute to this project on a regular basis. I have a lot of Web-scraping, A.I./LLMs, CI/CD, Automation, experience and would like to discuss with the main collaborators and see were I can be of the most use.
I updated with prettier formatting for the files that failed README.md, src/config.ts, src/core.ts, and config.ts. I also added the formatting for jsdoc/typedoc as recommened by @marcelovicentegc in response to my orginal pull request #102. Additionally, I added .prettierignore file.
@marcelovicentegc this look good to you to merge?
@marcelovicentegc this look good to you to merge?
Hey @steve8708! Happy new years! One rebase and a few nitpicks ☝️ and it occurs to me that we are good to go 🤗
Please merge this branch ASAP!