wombat
wombat copied to clipboard
400 Bad Request on some websites.
Hello, I noticed some strange behaviour of Wombat. Let's say I want to crawl 2 websites firstly I was using Typhoeus and Regex to crawl websites, but there was one website which constantly was giving me 302 and then i found Wombat but now the interesting thing is that when I use wombat for it it works perfectly, but when I try wombat on the other website i get an error which is
/.rvm/gems/ruby-2.1.5/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:308:in `fetch': 400 => Net::HTTPBadRequest for "THE_WEBSITE_URL" -- unhandled response (Mechanize::ResponseCodeError)
And the URL is correct ... I tried it in the browser and it worked. So can somebody help me with this one.. Also I don't have puts in front of Wombat.crawl do ... because I saw this also as a problem. Thank you in advance and sorry for my english!
Can you share the exact URL that is causing the problem? Under the hood, Wombat is using Mechanize to request the page, so it could be either a Mechanize bug or a misconfiguration
So here is the full response:
/Users/IvoDukov/.rvm/gems/ruby-2.1.5/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:308:in `fetch': 400 => Net::HTTPBadRequest for *the_url* -- unhandled response (Mechanize::ResponseCodeError)
from /Users/IvoDukov/.rvm/gems/ruby-2.1.5/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:976:in `response_redirect'
from /Users/IvoDukov/.rvm/gems/ruby-2.1.5/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:300:in `fetch'
from /Users/IvoDukov/.rvm/gems/ruby-2.1.5/gems/mechanize-2.7.3/lib/mechanize.rb:440:in `get'
from /Users/IvoDukov/.rvm/gems/ruby-2.1.5/gems/wombat-2.3.0/lib/wombat/processing/parser.rb:47:in `parser_for'
from /Users/IvoDukov/.rvm/gems/ruby-2.1.5/gems/wombat-2.3.0/lib/wombat/processing/parser.rb:33:in `parse'
from /Users/IvoDukov/.rvm/gems/ruby-2.1.5/gems/wombat-2.3.0/lib/wombat/crawler.rb:30:in `crawl'
from websites/net-a-porter/link_crawler.rb:78:in `<main>'
And here is my code:
class LinksCrawler
include Wombat::Crawler
base_url website_base_url
path category_path
links({:xpath => '//div[@class="description"]/a[contains(@href, "product")]/@href'}, :list)
end
link_crawler = LinksCrawler.new
link_crawler.crawl
I don't want to share the exact url because of security purposes, but I can tell you that if you paste it in the browser it works for sure.