tiny-web-crawler icon indicating copy to clipboard operation
tiny-web-crawler copied to clipboard

Feature: Add a feature to only crawl the given list of urls

Open indrajithi opened this issue 1 year ago • 5 comments

  • Accept a argument from the user. Something like url_list
  • Crawl only the urls provided by the users as an argument and nothing else.

indrajithi avatar Jun 15 '24 03:06 indrajithi

Wouldn't that be set by the Spider.max_links value?

lodenrogue avatar Jun 15 '24 03:06 lodenrogue

@lodenrogue Max max_links is basically the max hops the crawler will make. Let us say we start from github.com as the root url. In the first crawl we will fetch all the links in github.com and then recursively crawl all the links we fetched until max_link count is reached.

Eg: Say we found three links from the root url: [URL1, URL2, URL3] If the max link is set as 2. We will only crawl [URL1, URL2] and fetch the links in that.

This feature we are expecting the crawler to fetch the urls provided by the user and nothing more. The list of urls to crawl will be a custom set provided by the user as input. There will be no root url base crawls and hops.

indrajithi avatar Jun 15 '24 03:06 indrajithi

For example, url_list = [URL1, URL2, URL3], so we will loop through this url_list and fetch the link, but there will be no root url. If I am getting it right, I would love to solve it and ask for the assignment of this issue to me.

faisalalisayyed avatar Jun 16 '24 08:06 faisalalisayyed

Hi @C0DE-SLAYER.

Please let us know if you are working on this?

indrajithi avatar Jun 17 '24 21:06 indrajithi

@indrajithi yes I will open a PR by today

faisalalisayyed avatar Jun 18 '24 05:06 faisalalisayyed