tiny-web-crawler Feature: Add a feature to only crawl the given list of urls

Feature: Add a feature to only crawl the given list of urls

Open indrajithi opened this issue 1 year ago • 5 comments

Accept a argument from the user. Something like url_list
Crawl only the urls provided by the users as an argument and nothing else.

Jun 15 '24 03:06 indrajithi

Wouldn't that be set by the Spider.max_links value?

Jun 15 '24 03:06 lodenrogue

@lodenrogue Max max_links is basically the max hops the crawler will make. Let us say we start from github.com as the root url. In the first crawl we will fetch all the links in github.com and then recursively crawl all the links we fetched until max_link count is reached.

Eg: Say we found three links from the root url: [URL1, URL2, URL3] If the max link is set as 2. We will only crawl [URL1, URL2] and fetch the links in that.

This feature we are expecting the crawler to fetch the urls provided by the user and nothing more. The list of urls to crawl will be a custom set provided by the user as input. There will be no root url base crawls and hops.

Jun 15 '24 03:06 indrajithi

For example, url_list = [URL1, URL2, URL3], so we will loop through this url_list and fetch the link, but there will be no root url. If I am getting it right, I would love to solve it and ask for the assignment of this issue to me.

Jun 16 '24 08:06 faisalalisayyed

Hi @C0DE-SLAYER.

Please let us know if you are working on this?

Jun 17 '24 21:06 indrajithi

@indrajithi yes I will open a PR by today

Jun 18 '24 05:06 faisalalisayyed

tiny-web-crawler tiny-web-crawler copied to clipboard

Feature: Add a feature to only crawl the given list of urls

tiny-web-crawler
tiny-web-crawler copied to clipboard