AutoGPT icon indicating copy to clipboard operation
AutoGPT copied to clipboard

Add reCAPTCHA and 'I am not a robot' check to web scraper. Related to issues #2293

Open nanaofosu opened this issue 1 year ago • 6 comments

… issues #2293

Background

This pull request adds a check for CAPTCHA or "I am not a robot" to the Selenium web scraper in order to address the issue of human blocking. Previously, the scraper would continue to run despite such checks, leading to inaccurate results. This enhancement ensures that the scraper can collect accurate data from the website.

Changes

The scrape_text_with_selenium function in the selenium_scraping.py module was modified to add a check for CAPTCHA or "I am not a robot" by searching for those strings in the page source. If either string is found, the code pauses execution and prompts the user to complete the check before continuing. This change ensures that the scraper can accurately collect data from websites that have human blocking in place.

Documentation

The changes are documented in the code using comments that describe the purpose and functionality of the added check. Additionally, this pull request includes documentation in the form of a README file that explains how to use the web scraper and how to contribute to the project.

Test Plan

To test this functionality, we ran the web scraper on several websites that have CAPTCHA or "I am not a robot" checks in place. In each case, the scraper paused execution and prompted the user to complete the check before continuing. We also tested the scraper on websites without such checks to ensure that the functionality was not impacted by the added check.

PR Quality Checklist

  • [x] My pull request is atomic and focuses on a single change.
  • [x] I have thoroughly tested my changes with multiple different prompts.
  • [x] I have considered potential risks and mitigations for my changes.
  • [x] I have documented my changes clearly and comprehensively.
  • [x] I have not snuck in any "extra" small tweaks changes

nanaofosu avatar Apr 18 '23 03:04 nanaofosu

Also stupid Cookie Notices need to be ignored too...

hugo4711 avatar Apr 18 '23 17:04 hugo4711

Also stupid Cookie Notices need to be ignored too...

@hugo4711 Can you try this and let me know if it works? I am at work so I can't test it.

    if "captcha" in driver.page_source.lower() or "i am not a robot" in driver.page_source.lower() or "cookie" in driver.page_source.lower():
        try:
            accept_cookies_button = driver.find_element_by_xpath("//button[contains(text(), 'Accept') or contains(text(), 'I accept') or contains(text(), 'Agree') or contains(text(), 'OK')]")
            accept_cookies_button.click()
        except:
            pass
        input("Please complete the CAPTCHA, 'I am not a robot' check, or accept the cookies and press Enter to continue...")```

nanaofosu avatar Apr 18 '23 19:04 nanaofosu

@nanaofosu Could you pinpoint me where I need to put that code?

hugo4711 avatar Apr 19 '23 11:04 hugo4711

This is a mass message from the AutoGPT core team. Our apologies for the ongoing delay in processing PRs. This is because we are re-architecting the AutoGPT core!

For more details (and for infor on joining our Discord), please refer to: https://github.com/Significant-Gravitas/Auto-GPT/wiki/Architecting

p-i- avatar May 05 '23 00:05 p-i-

A better alternative to this would be to use Selenium Awaiters instead. That way, you don't have to press "enter" to continue

anonhostpi avatar May 05 '23 04:05 anonhostpi

Hey, thanks for the great work! We apologize for not getting to this sooner. Unfortunately, this is now a bit outdated as now by default, the browser starts as headless by default. Could you think of a new implementation?

gravelBridge avatar May 16 '23 16:05 gravelBridge

@nanaofosu do you plan on updating this as requested by @gravelBridge ?

ntindle avatar Jun 07 '23 04:06 ntindle

blocked: Needs updated to support headless

ntindle avatar Jun 07 '23 04:06 ntindle

Deployment failed with the following error:

Resource is limited - try again in 9 hours (more than 100, code: "api-deployments-free-per-day").

vercel[bot] avatar Jun 07 '23 04:06 vercel[bot]

Codecov Report

Patch coverage has no change and project coverage change: -0.04 :warning:

Comparison is base (463dc54) 69.71% compared to head (fe6e835) 69.67%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2299      +/-   ##
==========================================
- Coverage   69.71%   69.67%   -0.04%     
==========================================
  Files          72       72              
  Lines        3560     3562       +2     
  Branches      569      570       +1     
==========================================
  Hits         2482     2482              
- Misses        889      890       +1     
- Partials      189      190       +1     
Impacted Files Coverage Δ
autogpt/commands/web_selenium.py 81.81% <0.00%> (-1.52%) :arrow_down:

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Jun 07 '23 05:06 codecov[bot]

This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request.

github-actions[bot] avatar Aug 19 '23 15:08 github-actions[bot]