crawlee-python
crawlee-python copied to clipboard
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...
# Crawlee for Python Hacktoberfest 2024 [Starting Oct 1, 2024] data:image/s3,"s3://crabby-images/caed3/caed3e9885530dd7a414edde93e60603fa986f4a" alt="Hacktober 2024 Crawlee" # Prizes 🏆 - 1-2 Accepted Pull Request: Crawlee Exclusive Sticker Sheet. - 2 or more Accepted...
This adds a unified `crawler` template. The original `playwright` and `beautifulsoup` templates are kept for compatibility with older versions of the CLI. The user is now prompted for package manager...
e.g. with an `--apify` flag - this should add SDK to requirements and activate the `Actor` context manager in the main function
More details in https://github.com/apify/crawlee-python/pull/466#issuecomment-2312331905
- https://github.com/giampaolo/psutil/issues/1011 - https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/memory-info.ts#L53 - JS version has special cases for AWS lambda and docker
Using the `CurlImpersonateHttpClient` adds this warning message upon program start, which doesn't seem to be fixed if I add in the command that it asks for ``` asyncio.set_event_loop_policy(WindowsSelectorEventLoopPolicy()) await crawler.run(["https://www.mtggoldfish.com/metagame/modern#paper"])...
- We could create a new documentation guide for all crawling-related features we provide. - The guide should include the following: - `enqueu_links` helper function, - Crawling limitations and controls:...