Jan Čurn issues

Results 27 issues of


                                            Jan Čurn

Consider auto-failing pages that have 5xx HTTP status code

Currently, these pages are not considered failed, and thus not retried. On the other hand, Cheerio Scraper retries them. We should probably consider 5xx errors as failures and retry.

Create generic function to perform login to page using Puppeteer

Most login pages just have username/email, password and submit button. We could write a function like `Apify.utils.puppeteer.login()` that would try to find these fields, fill them with provided values and...

feature

t-tooling

Idea: we could add function to extract schema.org microdata from a page

It could be called `Apify.utils.puppeteer.extractMicrodata` and look something like this: https://help.apify.com/en/articles/6988663-scraping-data-from-websites-using-schema-org-microdata but ideally, it wouldn't use jQuery.

feature

good first issue

t-tooling

The 'format' parameter in Dataset.getData() only works with cloud storage

It doesn't work with local directory storage. Also the documentation of this and related functions is not great. E.g. how does the format work together with `Dataset.forEach` function?

bug

RequestQueue.getRequest() should use local cache

This shouldn't cause any problem and can greatly improve performance. See TODO at https://github.com/apifytech/apify-js/blob/master/src/request_queue.js#L276

bug

feature

t-tooling

Add an utility function to take screenshots from Puppeteer larger than 16k pixels

Basically, Puppeteer can only take screenshots with the width or height at most 16,834px (this is hard-coded Chrome limit, see https://github.com/GoogleChrome/puppeteer/issues/359). However, for one customer project, we need screenshots of...

Explore parallelization for CrawlerCheerio

Cheerio is quite CPU intensive, so for higher concurrency of the crawler, the CPU chokes. We should explore whether it's possible to run Cheerio download and parsing in a separate...

feature

discussion

t-tooling

Jan Čurn

Consider auto-failing pages that have 5xx HTTP status code

Create generic function to perform login to page using Puppeteer

Idea: we could add function to extract schema.org microdata from a page

The 'format' parameter in Dataset.getData() only works with cloud storage

RequestQueue.getRequest() should use local cache

Add an utility function to take screenshots from Puppeteer larger than 16k pixels

Explore parallelization for CrawlerCheerio

Add pendingRequestCount field to BasicCrawler

Proxy server doesn't support HTTP UPGRADE for web sockets

Add option to skip deleteInternalProperties()