headless-chrome-crawler issues

Get current URL in customCrawl()

3

**What is the current behavior?** No information about current URL in customCrawl() **What is the motivation / use case for changing the behavior?** I'm want to skip request, but add...

popstas

feature

Proxy not working --proxy-server

Hello, Puppeteer supports proxy but Headless Chrome Crawler doesn't work either. ``` const HCCrawler = require('headless-chrome-crawler'); (async () => { const crawler = await HCCrawler.launch({ args: ['--ignore-certificate-errors', '--proxy-server=127.0.0.1:8080', '--no-sandbox' ],...

byposeidon

Crawling site with maxDepth > 2 causes hang

3

I'm crawling a small site with maxDepth === 2, and things crawl fine. As soon as up it to 3 or more, the the crawler hangs. I don't see onError...

TheTFo

Pages with 403 errors not throwing errors

1

**What is the current behavior?** When you crawl a page that throws a 403 unauthorized error the crawler just hangs and stays there indefinitely. It ignores all timeouts and doesn't...

mrispoli24

feature

Queueing same url on multiple workers in cluster with Redis cache results in duplicates

4

**What is the current behavior?** Using a Redis cache for the queue and a cluster of processes crawling, the crawler is repeating requests. **If the current behavior is a bug,...

lioreshai

duplicated url are crawled twice

6

**What is the current behavior?** Duplicated urls are not skipped. The same url is crawled twice. **If the current behavior is a bug, please provide the steps to reproduce** ```...

Minyar2004

bug

How can i make customCrawl click on specific elements?

1

I want to make my customcrawl click on elements. They dont have a href, but a js onclick event. Is this possible, and how and where in the code can...

michaelpapesch

Is there a way to scroll?

4

**What is the current behavior?** No documented way of scrolling **What is the expected behavior?** Being able to scroll **What is the motivation / use case for changing the behavior?**...

wemow

i get a JSHandle@node string instead of a ElementHandle object

**What is the current behavior?** `page.$$()` method just returns an "**JSHandle@node**" string instead of a **ElementHandle** object. **If the current behavior is a bug, please provide the steps to reproduce**...

jriffs

subdomain crawl with "allowedDomains" parameter crawls top domain, too

For the domain "test.domain.com" result.response.url includes urls from "domain.com", too. I tried it with the subdomain name and regexp. I don't understand, why, shouldn't "allowedDomains" parameter prevent scanning from URLs...

michaelpapesch

headless-chrome-crawler
headless-chrome-crawler copied to clipboard

Metadata

Get current URL in customCrawl()

Proxy not working --proxy-server

Crawling site with maxDepth > 2 causes hang

Pages with 403 errors not throwing errors

Queueing same url on multiple workers in cluster with Redis cache results in duplicates

duplicated url are crawled twice

How can i make customCrawl click on specific elements?

Is there a way to scroll?

i get a JSHandle@node string instead of a ElementHandle object

subdomain crawl with "allowedDomains" parameter crawls top domain, too

← Metadata

Owner

Metadata

headless-chrome-crawler headless-chrome-crawler copied to clipboard

Metadata

← Metadata

Owner

Metadata

headless-chrome-crawler
headless-chrome-crawler copied to clipboard