crawlers icon indicating copy to clipboard operation
crawlers copied to clipboard

HTTPCollector duplicates listeners when multiple crawlers have set them

Open dutsuwak opened this issue 2 years ago • 1 comments

Hello!

I have been doing some tests in a situation where multiple crawlers are set each a with a Listener for a Crawl event. When the HttpCrawlerConfigs are added to the HttpCollector it duplicates the listeners therefore calling multiple times the logic of my program.

Simplified example:

HttpCollectorConfig config = new HttpCollectorConfig();
List<HttpCrawlerConfig> httpCrawlerConfigs = new ArrayList<>();

for(int i = 0; i < urlsList.length; i++){
    var httpCrawlerConfig = new HttpCrawlerConfig();
    httpCrawlerConfig.setEventListeners(new CrawlEventListener());

    httpCrawlerConfigs.add(httpCrawlerConfig);
}

HttpCrawlerConfig[] crawlerConfigs = httpCrawlerConfigs.toArray(new HttpCrawlerConfig[httpCrawlerConfigs.size()]);
config.setCrawlerConfigs(crawlerConfigs);


var collector = new HttpCollector(collectorConfig); // From the debugging I did seems it happens when it scans the crawlers 
collector.start();                                  // configs here, and duplicates the listeners in the event manager

I did a workaround to set the listeners only for the first HttpCrawlerConfig, but I think it should be possible to use separate listeners for each Crawler.

Regards, Fabian

dutsuwak avatar May 02 '22 17:05 dutsuwak

Hello Fabian!

Technically, the listeners are not duplicated but rather invoked for ALL events fired by the collector that is an instance of your listener "accept" method argument. That includes events from other crawlers.

It is by design as there might be legit cases for a crawler to want to know what is happening in another crawler for whatever reason. I understand it is not the most intuitive though.

Since it is possible to configure event listeners at both the collector-level and the crawler-level, it would make sense to imply an event hierarchy there and provide isolation from other crawlers when registered only for a specific crawler.

Since there are valid use cases for both approaches, I think we need to make it more flexible and offer an easy way to adjust the listening scope and maybe change the default behaviour to the most intuitive one.

I will mark this as a feature request.

essiembre avatar May 12 '22 05:05 essiembre