spidr
spidr copied to clipboard
SSL session reuse may fail
I've just run into a situation where the reuse of an SSL session caused an exception and Spidr subsequently skipped the page. Currently, the exception is silently swallowed, so I modified it to grab the following trace:
EOFError (end of file reached): /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/openssl/buffering.rb:174:in `sysread_nonblock' /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/openssl/buffering.rb:174:in `read_nonblock' /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/protocol.rb:141:in `rbuf_fill' /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/protocol.rb:122:in `readuntil' /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/protocol.rb:132:in `readline' /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:2562:in `read_status_line' /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:2551:in `read_new' /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:1319:in `block in transport_request' /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:1316:in `catch' /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:1316:in `transport_request' /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:1293:in `request' rest-client (1.6.7) lib/restclient/net_http_ext.rb:51:in `request' /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:1026:in `get' spidr (0.4.1) lib/spidr/agent.rb:513:in `block in get_page' spidr (0.4.1) lib/spidr/agent.rb:684:in `prepare_request' spidr (0.4.1) lib/spidr/agent.rb:512:in `get_page' app/models/cookie_login_option.rb:150:in `fetch_remote_form' app/models/cookie_login_option.rb:158:in `block in fetch_remote_form' spidr (0.4.1) lib/spidr/agent.rb:518:in `block in get_page' spidr (0.4.1) lib/spidr/agent.rb:684:in `prepare_request' spidr (0.4.1) lib/spidr/agent.rb:512:in `get_page' app/models/cookie_login_option.rb:150:in `fetch_remote_form' app/models/cookie_login_option.rb:158:in `block in fetch_remote_form' spidr (0.4.1) lib/spidr/agent.rb:518:in `block in get_page' spidr (0.4.1) lib/spidr/agent.rb:684:in `prepare_request' spidr (0.4.1) lib/spidr/agent.rb:512:in `get_page'
If I modify the code to remove the session cache, I am able to fetch the page okay. It might be good to catch EOFError and retry with a new session in the event this happens. Catching the error all over the place could be messy though.
Could this be a version issue? I had something like this happen to me with a simple spider that printed the urls from a site. Using ree it would fail, while with 2.0.0 is would work fine.