headless-chrome-crawler icon indicating copy to clipboard operation
headless-chrome-crawler copied to clipboard

Distributed crawler powered by Headless Chrome

Results 33 headless-chrome-crawler issues
Sort by recently updated
recently updated
newest added

I want to create a general purpose crawler with this project. By general purpose i mean - if the url leads to pdf i want it to render the pdf,...

bug

**What is the current behavior?** Looks like crawler doesn't call preRequest for links with empty href? **If the current behavior is a bug, please provide the steps to reproduce** ```...

I've ran into an error on a page containing frames where for some reason `document` was being passed in as null. This check fixed the error.

**What is the current behavior?** preRequest function cutting a lot of links in case of URL regexp filtering **If the current behavior is a bug, please provide the steps to...

**Background** LOVE this project! I tried to write my own BaseCache instance to use LevelDB and have some general feedback. **What is the current behavior?** The difference between `get(key)`, `set(key,...

chore

**What is the current behavior?** The `robots.txt` is re-parsed for every request but those files can be big. Today Google only reads the first [500 Kb](https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt?csw=1#file-format) and ignore the rest....

feature

**What is the current behavior?** The `Crawl-Delay` is ignored. **What is the expected behavior?** The `Crawl-Delay` should be honored, it can be retrieved using `getCrawlDelay()` on the robots parser. **What...

feature

**What is the current behavior?** I don't believe the crawler is handling sitemaps broken out into multiple sitemaps. This is common in large sites since sitemaps are limited to 50k...

feature

**What is the current behavior?** Today the project automatically resolves the robots.txt. **What is the expected behavior?** It would be useful to be able to provides the robots.txt instead to...

feature

**What is the current behavior?** _collectLinks only keep the href of URLs. **What is the expected behavior?** Would be nice to have, or be able to request also: - The...

feature