headless-chrome-crawler icon indicating copy to clipboard operation
headless-chrome-crawler copied to clipboard

Crawl stops on non-www URLs

Open cosmiXs opened this issue 6 years ago • 5 comments
trafficstars

What is the current behavior? If I specify a domain like eg. "http://www.domainname.com/" but the preferred domain settings on the server are without "www." then the crawling process stops.

The REVERSE is also valid unfortunately if I specify a DOMAIN that has "www" but I do not specify it eg. "http://domainname.com/" the crawling also STOPS.

If the current behavior is a bug, please provide the steps to reproduce

What is the expected behavior? Normally I expect it to recognize the domain name without "www"

What is the motivation / use case for changing the behavior?

Please tell us about your environment:

  • Version: latest
  • Platform / OS version: mac OS X
  • Node.js version: latest

cosmiXs avatar Jan 09 '19 17:01 cosmiXs

Sorry, but i cannot understand what you want to say. What you really want to do? If a site don't use www on domain, you have no reason to crawl using www. I think that's not a bug. Please provide more details.

matheuschimelli avatar Jan 20 '19 19:01 matheuschimelli

I do not know in advance if a domain has explicitly require www. or non-www. I have a list of domains that I want to crawl, I've placed them into a file and I'm reading them from there. By default I'm putting www. in front of all the domains, but when the crawler reaches a domain that explicitly does not have www. (this is how is forced by Preferred domain server setting) then the crawler only acceses the home page than exits.

cosmiXs avatar Jan 24 '19 08:01 cosmiXs

hey cosmiXs I made an npm package to fix this. It's called 'redirect-chain'. You give it your entrypoint url and then it gives you the domain redirect chain. Then use this array as allowedDomains.

https://www.npmjs.com/package/redirect-chain

simlevesque avatar Mar 07 '19 21:03 simlevesque

I'm having the same problem. I get a timeout when visiting a non www url. I can visit it on the browser just fine.

I tried using @simlevesque's solution but I sill get the same problem.

await crawler.queue({
   url,
   allowedDomains: await redirectChain.domains(url);
});

still no luck. I'm getting a Error: Navigation Timeout Exceeded: 30000ms exceeded

yvhr avatar Mar 29 '19 03:03 yvhr

@cosmiXs @vycoder could you provide a full code example to reproduce the issue?

kulikalov avatar Oct 17 '20 07:10 kulikalov