WebReaper icon indicating copy to clipboard operation
WebReaper copied to clipboard

Enhancement - End Engine's task once it's done scraping and reached all the target pages available.

Open bogdan799 opened this issue 1 year ago • 2 comments

Hello,

First of all - I very much appreciate the work you did for this project. I've just tried the library and man it's cool, works like magic and is very configurable.

As far as I understand, the main use case is the long-running engine to collect lots of data from the website and store it somewhere. However, it might also be very useful when there's a finite amount of data to retrieve and this has to be done in finite time, let's say to navigate a few pages, parse some data and return it back. And as far I can tell, it's very hard to achieve here.

Two possible ways I found to get data and parse it to an object are either by using Subscribe() and then using the JObject to deserialize data to object or implementing own IScraperSink and storing the data there for further usage. I'm fine with both solutions and I've tested them - they both work perfectly. However, when we start the engine - it never stops even when there's nothing to parse because no one closes Channel, since it's opened forever - AsyncEnumerable never ends.

Therefore, I propose to make a change that the engine would actually store the current parse status in a form of a tree and once we've reached the state where all the leaf pages are of PageCategory TargetPage type, we close the channel and allow Parallel.ForEachAsync stops its execution returning us back from the engine and allowing actually await engine execution before retrieving results. It might not be perfect and I'm sure it's not, it's just the first thing I have in my mind that could work, maybe you have different ideas.

Please let me know what you think about this and whether you have plans and time for this enhancement.

Thank you, Bogdan

bogdan799 avatar Jul 11 '23 21:07 bogdan799

Hi Bogdan,

Appreciate your feedback and suggestions! It was quite intense at work, so I had little time for improvements.

I like your idea and plan to implement it one way or another. At the moment the only way to stop the engine is to specify the page crawl limit beforehand:

var engine = await new ScraperEngineBuilder()
   ...
    .PageCrawlLimit(100)
    .BuildAsync();

pavlovtech avatar Aug 11 '23 00:08 pavlovtech

This should be documented in the Readme. I think it's a common use case to run this in the background somewhere, you'd think it would stop on its own when it's done.

Also another way, if you know you don't have that much links to scrape you can use the CancelationToken to timebox it.

var cts = new CancellationTokenSource();

cts.CancelAfter(TimeSpan.FromMinutes(5));

try
{
    await engine.RunAsync(cts.Token);
}
catch (OperationCanceledException)
{
    // do nothing
}

Marcel0024 avatar Jun 16 '24 15:06 Marcel0024