headless-chrome-crawler
headless-chrome-crawler copied to clipboard
Crawl stops on non-www URLs
What is the current behavior? If I specify a domain like eg. "http://www.domainname.com/" but the preferred domain settings on the server are without "www." then the crawling process stops.
The REVERSE is also valid unfortunately if I specify a DOMAIN that has "www" but I do not specify it eg. "http://domainname.com/" the crawling also STOPS.
If the current behavior is a bug, please provide the steps to reproduce
What is the expected behavior? Normally I expect it to recognize the domain name without "www"
What is the motivation / use case for changing the behavior?
Please tell us about your environment:
- Version: latest
- Platform / OS version: mac OS X
- Node.js version: latest
Sorry, but i cannot understand what you want to say. What you really want to do? If a site don't use www on domain, you have no reason to crawl using www. I think that's not a bug. Please provide more details.
I do not know in advance if a domain has explicitly require www. or non-www. I have a list of domains that I want to crawl, I've placed them into a file and I'm reading them from there. By default I'm putting www. in front of all the domains, but when the crawler reaches a domain that explicitly does not have www. (this is how is forced by Preferred domain server setting) then the crawler only acceses the home page than exits.
hey cosmiXs I made an npm package to fix this. It's called 'redirect-chain'. You give it your entrypoint url and then it gives you the domain redirect chain. Then use this array as allowedDomains.
https://www.npmjs.com/package/redirect-chain
I'm having the same problem. I get a timeout when visiting a non www url. I can visit it on the browser just fine.
I tried using @simlevesque's solution but I sill get the same problem.
await crawler.queue({
url,
allowedDomains: await redirectChain.domains(url);
});
still no luck. I'm getting a Error: Navigation Timeout Exceeded: 30000ms exceeded
@cosmiXs @vycoder could you provide a full code example to reproduce the issue?