crawl4ai
crawl4ai copied to clipboard
Bot issue
crawl4ai version
Crawl4AI 0.5.0.post8
Expected Behavior
Hi, I'm new to Crawl4AI and I'm facing some issues that need clarification.
I'm trying to scrape data from sites like PitchBook and CrunchBase, but I'm encountering human verification screens. As a result, I'm getting the verification page content instead of the actual page content unless I manually verify on an open browser tab.
My questions are:
- How can I bypass human verification or scrape website content without opening a browser tab?
- How can I scrape inner pages or multiple pages of a website?
- How can I deploy this with API calls or a similar approach?
Current Behavior
(crawl4ai-env) wiizbusiness@WiiZs-Laptop web-crawler % python test.py [INIT].... → Crawl4AI 0.5.0.post8 [FETCH]... ↓ https://pitchbook.com/profiles/company/106751-98... | Status: True | Time: 1.61s [SCRAPE].. ◆ https://pitchbook.com/profiles/company/106751-98... | Time: 0.004s [COMPLETE] ● https://pitchbook.com/profiles/company/106751-98... | Status: True | Total: 1.61s Was success: True
pitchbook.com
Verifying you are human. This may take a few seconds.
pitchbook.com needs to review the security of your connection before proceeding.
Verification successful
Waiting for pitchbook.com to respond...
Ray ID: 932b2267be43179a
Performance & security by Cloudflare
(crawl4ai-env) wiizbusiness@WiiZs-Laptop web-crawler %
Is this reproducible?
Yes
Inputs Causing the Bug
url= 'https://pitchbook.com/profiles/company/106751-98',
Steps to Reproduce
Code snippets
OS
macOS
Python version
Python 3.13.2
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response
@unclecode Even i am facing same issue
@rashidwiizb there r multiple approaches can be exercised here. One is to follow "identify based" crawling. You login to your account, build your human identity, then use the "user profile data directory" to attach your new browser session and then start crawling. I have demonstrated in one of my video. We also discuss about it, tomorrow meetup and release the video later, please check
Hi @unclecode i have checked pitchbook with above way and thats looks fine. But same bot verification is happening on crunchbase too , and there i can't bypass it by above way. Its refreshing and still showing bot verification page.
Also i need to handle this for all website and all kind of bot verfication according with website
Reproducible URL:
$ crwl https://japanworld.it/en/preordini/25559-furyu-tenitol-spriggan-yu-ominae-4580736406933.html -o markdown
# japanworld.it
Verifying you are human. This may take a few seconds.
japanworld.it needs to review the security of your connection before proceeding.
Verification successful
Waiting for japanworld.it to respond...
Ray ID: `95c0626e9b9d0cea`
Performance & security by [Cloudflare](https://www.cloudflare.com?utm_source=challenge&utm_campaign=m)
Any suggestions about this issue?
Hey! If you’re trying to get past bot protection, you’ve got a couple of solid options with Crawl4AI:
- Use Stealth Mode We ship a built‑in stealth mode that randomizes fingerprints and tightens up automation signals. You can enable it in your config or from the CLI. Full docs with examples: https://docs.crawl4ai.com/advanced/undetected-browser/
- Add CAPTCHA Solving (e.g. CapSolver) For sites that still challenge you with CAPTCHA, you can plug in CapSolver or another provider to auto-solve those. We have a ready-made example and walkthrough here: https://github.com/unclecode/crawl4ai/tree/main/docs/examples/capsolver_captcha_solver
Try stealth first, and layer in CapSolver if the site still blocks you. Thanks!