crawlee-python issues

Final doc & readme polishing

These items are currently blocked but should be resolved before the public launch (8. 7.). ### TODO - [x] Replace all occurrences of `apify.github.io/crawlee-python` with `crawlee.dev/python` in `README.md` once the...

vdusek

documentation

t-tooling

Add doc section guides

- Only markdown content. - Inspiration: https://crawlee.dev/docs/guides. - Some content from old readme could be copied in - https://github.com/apify/crawlee-python/blob/v0.0.7/README.md.

vdusek

documentation

t-tooling

Adaptive playwright crawler

janbuchar

enhancement

t-tooling

Sitemap-based request provider

similar to what we're implementing in JS crawlee

janbuchar

enhancement

t-tooling

fix: byte size serialization in MemoryInfo

janbuchar

t-tooling

adhoc

BasicCrawler status logging

- configurable interval - configurable status message callback (constructor parameter, property or decorator?) - we periodically set the crawler status via storage client - in javascript crawlee, this does nothing...

janbuchar

t-tooling

fix: request order on resumed crawl

This change makes https://github.com/apify/apify-sdk-python/blob/162ce1080d024fe2cf399534e8f960a584524232/tests/unit/actor/test_actor_memory_storage_e2e.py#L54 pass again. The PR is a draft, it exists mostly so that I don't lose or forget this.

janbuchar

t-tooling

adhoc

Implement `Snapshotter._snapshot_client()`

1

- Currently, there is only a dummy version of `Snapshotter._snapshot_client()` without a real measurement. - Once `StorageClient` is implemented, use it there to measure the real values. - Check TypeScript...

vdusek

enhancement

t-tooling

Request fetching from `RequestQueue` is sometimes very slow

1

- Fetching requests from `RequestQueue` is sometimes very slow and can get stuck for a while. - I turned on logging and reproduced the issue with the following code: ```python...

vdusek

bug

t-tooling

Check for selectors that indicate blocking in `PlaywrightCrawler`

https://github.com/apify/crawlee-python/blob/896501edb44f801409fec95cb3e5f2bcfcb4188d/src/crawlee/beautifulsoup_crawler/beautifulsoup_crawler.py#L86 can be used as reference

janbuchar

enhancement

t-tooling

crawlee-python
crawlee-python copied to clipboard

Metadata

Final doc & readme polishing

Add doc section guides

Adaptive playwright crawler

Sitemap-based request provider

fix: byte size serialization in MemoryInfo

BasicCrawler status logging

fix: request order on resumed crawl

Implement `Snapshotter._snapshot_client()`

Request fetching from `RequestQueue` is sometimes very slow

Check for selectors that indicate blocking in `PlaywrightCrawler`

← Metadata

Owner

Metadata

crawlee-python crawlee-python copied to clipboard

Metadata

← Metadata

Owner

Metadata

crawlee-python
crawlee-python copied to clipboard