AutoGPT
AutoGPT copied to clipboard
Add reCAPTCHA and 'I am not a robot' check to web scraper. Related to issues #2293
… issues #2293
Background
This pull request adds a check for CAPTCHA or "I am not a robot" to the Selenium web scraper in order to address the issue of human blocking. Previously, the scraper would continue to run despite such checks, leading to inaccurate results. This enhancement ensures that the scraper can collect accurate data from the website.
Changes
The scrape_text_with_selenium function in the selenium_scraping.py module was modified to add a check for CAPTCHA or "I am not a robot" by searching for those strings in the page source. If either string is found, the code pauses execution and prompts the user to complete the check before continuing. This change ensures that the scraper can accurately collect data from websites that have human blocking in place.
Documentation
The changes are documented in the code using comments that describe the purpose and functionality of the added check. Additionally, this pull request includes documentation in the form of a README file that explains how to use the web scraper and how to contribute to the project.
Test Plan
To test this functionality, we ran the web scraper on several websites that have CAPTCHA or "I am not a robot" checks in place. In each case, the scraper paused execution and prompted the user to complete the check before continuing. We also tested the scraper on websites without such checks to ensure that the functionality was not impacted by the added check.
PR Quality Checklist
- [x] My pull request is atomic and focuses on a single change.
- [x] I have thoroughly tested my changes with multiple different prompts.
- [x] I have considered potential risks and mitigations for my changes.
- [x] I have documented my changes clearly and comprehensively.
- [x] I have not snuck in any "extra" small tweaks changes
Also stupid Cookie Notices need to be ignored too...
Also stupid Cookie Notices need to be ignored too...
@hugo4711 Can you try this and let me know if it works? I am at work so I can't test it.
if "captcha" in driver.page_source.lower() or "i am not a robot" in driver.page_source.lower() or "cookie" in driver.page_source.lower():
try:
accept_cookies_button = driver.find_element_by_xpath("//button[contains(text(), 'Accept') or contains(text(), 'I accept') or contains(text(), 'Agree') or contains(text(), 'OK')]")
accept_cookies_button.click()
except:
pass
input("Please complete the CAPTCHA, 'I am not a robot' check, or accept the cookies and press Enter to continue...")```
@nanaofosu Could you pinpoint me where I need to put that code?
This is a mass message from the AutoGPT core team. Our apologies for the ongoing delay in processing PRs. This is because we are re-architecting the AutoGPT core!
For more details (and for infor on joining our Discord), please refer to: https://github.com/Significant-Gravitas/Auto-GPT/wiki/Architecting
A better alternative to this would be to use Selenium Awaiters instead. That way, you don't have to press "enter" to continue
Hey, thanks for the great work! We apologize for not getting to this sooner. Unfortunately, this is now a bit outdated as now by default, the browser starts as headless by default. Could you think of a new implementation?
@nanaofosu do you plan on updating this as requested by @gravelBridge ?
blocked: Needs updated to support headless
Deployment failed with the following error:
Resource is limited - try again in 9 hours (more than 100, code: "api-deployments-free-per-day").
Codecov Report
Patch coverage has no change and project coverage change: -0.04
:warning:
Comparison is base (
463dc54
) 69.71% compared to head (fe6e835
) 69.67%.
Additional details and impacted files
@@ Coverage Diff @@
## master #2299 +/- ##
==========================================
- Coverage 69.71% 69.67% -0.04%
==========================================
Files 72 72
Lines 3560 3562 +2
Branches 569 570 +1
==========================================
Hits 2482 2482
- Misses 889 890 +1
- Partials 189 190 +1
Impacted Files | Coverage Δ | |
---|---|---|
autogpt/commands/web_selenium.py | 81.81% <0.00%> (-1.52%) |
:arrow_down: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request.