Harvest icon indicating copy to clipboard operation
Harvest copied to clipboard

Add support for robots.txt

Open pinscript opened this issue 11 years ago • 4 comments

We should have built in features to handle robots.txt.

pinscript avatar Feb 25 '14 07:02 pinscript

Is it still open for contribution?

adeelkhalid992 avatar May 25 '17 12:05 adeelkhalid992

Hi,

It is, but I do not use this project anymore. I'd recommend sjdirect/abot or another crawler that uses the new async HttpClient.

pinscript avatar May 26 '17 08:05 pinscript

It would be much appreciated if you just tell me which pattern have you used for handling of threads and also for fetching links from websites, so, It could be easy for me to dig into this project and modify it with latest concepts of c#.

adeelkhalid992 avatar May 31 '17 07:05 adeelkhalid992

Sorry for the late reply, been away.

There is not much to it really. I am using the APM Model (https://msdn.microsoft.com/en-us/library/ms228963(v=vs.110).aspx) in https://github.com/alexandernyquist/Harvest/blob/master/Harvest/Downloader.cs.

Then there are two threads which controls all Downloaders at https://github.com/alexandernyquist/Harvest/blob/master/Harvest/DownloaderQueue.cs.

Links are scraped using the HtmlAgilityPack in the Page-class (https://github.com/alexandernyquist/Harvest/blob/master/Harvest/Page.cs).

But please, do not build on this code. It is way better to use modern techniques using async/await, System.Net.Http.HttpClient.

I can spike out an example for you if you'd like?

pinscript avatar Jun 05 '17 05:06 pinscript