crawlers
crawlers copied to clipboard
HTTP Collector and Hadoop
Will HTTP Collector work with Hadoop in a near future?
Not on the radar... yet. I thought this question would come earlier, but you are the first one to ask! :-)
Our focus has been maximum flexibility/extensibility over maximum quantity. In other words, "how many things can you do with one instance" over "how many docs can you process with multiple instances". It matches what we felt was needed the most (and matches the requests we get).
You can still have many different instances of the collector running in parallel with different start URLs or filters, to crawl many millions of pages.
For cases where you want to crawl the whole internet or just truly massive sites, a distributed crawl environment would indeed be more practical.
Maybe it is time we start thinking about this. Do you want to make this a feature request?
It should not be that hard to modify the collector code to have instances runnable in an Hadoop cluster and share processing of tons of URLs. Is that what you would envision or do you have something else in mind? Would you like the collector to take care of setting up the cluster itself and spawn itself according to whatever configuration?
Maybe it is time we start thinking about this.
Please, do. Although not everybody needs it, I think there is a nich on Windows.
Would you like the collector to take care of setting up the cluster itself and spawn itself according to whatever configuration
You are the expert. All what you propose sounds good to me. Just keep applying "flexibility/extensibility" now to "maximum quantity".
Do you want to make this a feature request?
Yes, I do
I am marking the integration with Hadoop as a feature request with no set release in mind. I'll pay attention to the demand. Anybody else reading this can chime in if they have a need for it as well.
Thank you