spidr Multithreading

Hi there,

I was wondering if it would be possible to multithread the spidr gem? I don't know much about multithreading in ruby, but I believe only Ruby 1.9.x is able to do so?

I had a look through the source but couldn't find where the spidr gem makes its http requests.

Maybe something like Typhoeus can be used?! (http://rubygems.org/gems/typhoeus)

Thanks, Ryan

Jun 06 '11 17:06 ethicalhack3r

This is possible, but difficult. The main problem is a race condition between the url/page callbacks and the requesting of pages. The callbacks could modify the filtering rules, as another thread is requesting a page, that is suddenly unwanted. The second problem is currently Spidr uses persistent HTTP connections, so I'm unsure how multi-threading would improve performance? We've been looking at alternative HTTP libraries, but they all have various pros/cons.

Jun 06 '11 20:06 postmodern

Thanks for the quick response. I don't know too much about multi-threading, maybe X amount of persistent HTTP connections can be opened?!

Either way seems like a difficult task to achieve.

Jun 07 '11 10:06 ethicalhack3r

If you decide to go with it, I'd give Celluloid a look. Alas, it is Ruby 1.9 only due to its use of fibers. But it's a pretty nice library.

May 08 '12 02:05 nirvdrum

I'm considering switching to net-http-persistent, a Thread pool for requests, with mutexes around adding filters.

May 08 '12 03:05 postmodern

+1, this seems to be the best spider/crawling library out, and this would be a great feature.

Oct 14 '13 09:10 grrowl

What happened with this request?

Dec 01 '15 20:12 dadamschi

I don't have the time currently to work on such a large feature.

Dec 01 '15 21:12 postmodern

been a year, any chance you have time to work on such a feature now? :-)

Jan 04 '17 15:01 ZeroChaos-

I've written more than a crawler or N in my career and if you didn't make it multi-threaded from the start, it is damn hard to do so in retrospect. Now, that said, I think the overall goal here is throughput rather than threads. If the discovered urls can be surfaced to an overall queue (Redis or SQS) then that would change the equation because rather than threads you simply run more instances (or containers) of Spidr and let the queue handle distribution of work across N copies.

Thoughts?

Apr 07 '17 20:04 fuzzygroup

A distributed Spidr is a little out of scope, or at least further down the road.

Multi-threading here is mainly to address blocking I/O when waiting on responses to come back from the HTTP Sessions. Luckily, net-http-persistent is already thread aware. We'd just need to replace the spidering loop with a producer/consumer thread pool. Each thread would have it's own session cache via net-http-persistent, would dequeue URLs, and enqueue the responses/Pages. All additional logic with headers and parsing HTML would still be done in the main thread, to avoid additional Mutex complexity. There's probably other hidden work and locking issues hidden in the details.

Apr 08 '17 02:04 postmodern

+1, a producer/consumer for the requests would be awesome! I really like the interface of your library by the way.

Sep 24 '17 14:09 vwochnik

I don't understand "a producer/consumer for the requests"...

================== David Adams [email protected]

On Sun, Sep 24, 2017 at 9:33 AM, Vincent Wochnik [email protected] wrote:

+1, a producer/consumer for the requests would be awesome! I really like the interface of your library by the way.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/postmodern/spidr/issues/26#issuecomment-331713971, or mute the thread https://github.com/notifications/unsubscribe-auth/AAzcSIIeYHPwcLScA05GBD8Qk5UGEPG0ks5slmhIgaJpZM4BF_tL .

Sep 24 '17 16:09 dadamschi

I mean a producer/consumer pattern where one a thread pool of worker threads that do the requesting are connected to the main thread with queues, like a manufacturing band.

The main thread puts all requests that it wants to have resolved in a queue and any worker thread can pick the task from the queue, do the request, and put it inside the finished responses queue which is being read by the main thread. In this way, the main thread does not do any requesting, i.e. blocking activity, itself which will lead to a speedup.

Sep 24 '17 16:09 vwochnik

spidr spidr copied to clipboard

Multithreading

spidr
spidr copied to clipboard