headless-chrome-crawler issues

Question: Handling of pdf files

1

I want to create a general purpose crawler with this project. By general purpose i mean - if the url leads to pdf i want it to render the pdf,...

eladbitton

bug

Are links with empty href ignored? (button links handled by page js)

1

**What is the current behavior?** Looks like crawler doesn't call preRequest for links with empty href? **If the current behavior is a bug, please provide the steps to reproduce** ```...

YuriGor

Check for a non-null document before looking for links

I've ran into an error on a page containing frames where for some reason `document` was being passed in as null. This check fixed the error.

dancodes

preRequest function cuts the entire branch instead of a single page

**What is the current behavior?** preRequest function cutting a lot of links in case of URL regexp filtering **If the current behavior is a bug, please provide the steps to...

AleksandrBorkun

Suggestion: BaseCache api is confusing and not efficient.

2

**Background** LOVE this project! I tried to write my own BaseCache instance to use LevelDB and have some general feedback. **What is the current behavior?** The difference between `get(key)`, `set(key,...

tomnielsen

chore

Suggestion: robots.txt shouldn't be reparsed every time

**What is the current behavior?** The `robots.txt` is re-parsed for every request but those files can be big. Today Google only reads the first [500 Kb](https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt?csw=1#file-format) and ignore the rest....

panthony

feature

Crawler should honor the Crawl-Delay if obeyRobotsTxt:true

2

**What is the current behavior?** The `Crawl-Delay` is ignored. **What is the expected behavior?** The `Crawl-Delay` should be honored, it can be retrieved using `getCrawlDelay()` on the robots parser. **What...

panthony

feature

[Feature Request] Add support for multiple sitemaps

6

**What is the current behavior?** I don't believe the crawler is handling sitemaps broken out into multiple sitemaps. This is common in large sites since sitemaps are limited to 50k...

NickStees

feature

You should be able to provides the robots.txt

**What is the current behavior?** Today the project automatically resolves the robots.txt. **What is the expected behavior?** It would be useful to be able to provides the robots.txt instead to...

panthony

feature

Suggestion: Collect links should be extendable and/or have more infos than their URL

1

**What is the current behavior?** _collectLinks only keep the href of URLs. **What is the expected behavior?** Would be nice to have, or be able to request also: - The...

panthony

feature

headless-chrome-crawler
headless-chrome-crawler copied to clipboard

Metadata

Question: Handling of pdf files

Are links with empty href ignored? (button links handled by page js)

Check for a non-null document before looking for links

preRequest function cuts the entire branch instead of a single page

Suggestion: BaseCache api is confusing and not efficient.

Suggestion: robots.txt shouldn't be reparsed every time

Crawler should honor the Crawl-Delay if obeyRobotsTxt:true

[Feature Request] Add support for multiple sitemaps

You should be able to provides the robots.txt

Suggestion: Collect links should be extendable and/or have more infos than their URL

← Metadata

Owner

Metadata

headless-chrome-crawler headless-chrome-crawler copied to clipboard

Metadata

← Metadata

Owner

Metadata

headless-chrome-crawler
headless-chrome-crawler copied to clipboard