huntsman icon indicating copy to clipboard operation
huntsman copied to clipboard

Follow all links within a specific container

Open cloud1250x4 opened this issue 9 years ago • 1 comments

Hi,

Your project is really interesting! I was wondering if it was possible to make it follow link within a specific container in the first loaded page? Or manually select which links to follow...

First loaded page -> ----> Container with links that we wanna scrape We scrape main info here/ ----------> Container in the page that contains link that we want to scrape (different then first page) We scrape the rest of the info here/

Since the links doesn't follow any particularity... (domain.com/abc1, domain.com/cba2) is it possible to use your project in such a manner?

Thank you!

cloud1250x4 avatar May 05 '16 19:05 cloud1250x4

hi, sorry for the late reply. you could manually handle the queuing of urls by removing the recurseextension.

doing so would mean that the clawler would only crawl a single url unless you added more urls manually, you can use the links extension to extract links from pages and then use spider.queue.add() to add them to the queue.

since you have control there you can decide which urls get added.

for a complete example, see the recurse extension source: see https://github.com/missinglink/huntsman/blob/master/lib/extension/recurse.js

missinglink avatar Aug 15 '16 13:08 missinglink