autogen icon indicating copy to clipboard operation
autogen copied to clipboard

WebSurfer Updated (Selenium, Playwright, and support for many filetypes)

Open afourney opened this issue 1 year ago • 11 comments

Why are these changes needed?

This PR add Selenium and Playwright variants of the Markdown Web Browser used by WebSurfer. It also adds support for many additional content-types, and support for alternate search engines.

All MarkdownBrowser variants work via the following principle: 1. Fetch a page, 2. Convert it to markdown, 3. Operate on the Markdown

Such browsers are simple, and suitable for read-only agentic use -- they cannot be used to interact with complex web applications. Nevertheless, they are a great stopgap, and super useful when browsing local files (file:///user/afourney/repos/autogen) etc. because they can handle many different file formats (Office docs, PDFs, etc.), provide a common interface for Q&A, summarization, passage extraction etc.

Instructions

When installing AutoGen, use the [websurfer] optional dependencies.

If using Selenium, you must also pip install selenium

If using Playwright you must both pip install playwright and playwright install --with-deps chromium

Related issue number

#1481, #1534, #1733, #1832

afourney avatar Mar 09 '24 07:03 afourney

Codecov Report

Attention: Patch coverage is 60.62133% with 469 lines in your changes are missing coverage. Please review.

Project coverage is 50.75%. Comparing base (c3193f8) to head (6ba05c9).

Files Patch % Lines
autogen/browser_utils/mdconvert.py 71.69% 133 Missing and 34 partials :warning:
autogen/browser_utils/markdown_search.py 22.98% 120 Missing and 4 partials :warning:
autogen/browser_utils/requests_markdown_browser.py 71.84% 51 Missing and 16 partials :warning:
...togen/browser_utils/playwright_markdown_browser.py 27.41% 45 Missing :warning:
autogen/agentchat/contrib/web_surfer.py 46.15% 23 Missing and 5 partials :warning:
autogen/browser_utils/selenium_markdown_browser.py 35.71% 27 Missing :warning:
autogen/browser_utils/abstract_markdown_browser.py 71.79% 11 Missing :warning:
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1929       +/-   ##
===========================================
+ Coverage   37.94%   50.75%   +12.80%     
===========================================
  Files          77       83        +6     
  Lines        7784     8776      +992     
  Branches     1667     2040      +373     
===========================================
+ Hits         2954     4454     +1500     
+ Misses       4580     3946      -634     
- Partials      250      376      +126     
Flag Coverage Δ
unittest 12.75% <0.08%> (?)
unittests 49.80% <60.62%> (+11.86%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Mar 09 '24 07:03 codecov-commenter

@signalprime @vijaykramesh @INF800

With this PR, I tried to combine your Selenium browser PRs together in one place. Even if it doesn't show in the commit history, I used and learned a lot from each of your contributions, and welcome your further comments and contributions here. Once this is ready, the final PR will credit each of you, and we can perhaps co-author a Blog post.

Further, I believe @INF800 and @vijaykramesh 's PRs used Selenium to call Bing search -- which is clever in that it simplifies requirements to get up and running (you don't need to register for an API key). However, I opted to leave this out in favor of the API because it is a better fit for our automated use. Bing actively discourages scraping, and supporting that approach long term would involve actively evading bot detection. I am open to adding further modularity and configurability to add other search engines, perhaps DuckDuckGo, ArXiv etc. that don't require an API key.

afourney avatar Mar 09 '24 17:03 afourney

This is great! We appreciate the credits and would love to co-author a blog post about it. A few weeks back I'd worked towards building DuckDuckGo search as an ability/skill that could be attached to agents as needed. I'll need to review the latest project path to ensure I'm adhering to future agreed-upon path and am ultimately encouraged to assist where else I may be useful to the project. Thanks @afourney and nice work!

signalprime avatar Mar 29 '24 04:03 signalprime

@signalprime DuckDuckGo would make a great addition and would be a good check on if the search mechanism is as easy to extend as I hope.

afourney avatar Mar 29 '24 04:03 afourney

i'm really excited about this one folks, looks like there's a lot to do yet, for youtube and more

Josephrp avatar Apr 01 '24 18:04 Josephrp

@afourney following up on our previous discussion, I'm curious about your plans for making the audio transcription logic in mdconvert.py reusable. Given that I'm currently working on audio capabilities for agents https://github.com/microsoft/autogen/pull/2098, do you have any thoughts on how we could develop a shared audio module?

WaelKarkoub avatar Apr 02 '24 05:04 WaelKarkoub

Thanks for the feedback @davorrunje Super helpful! Will address issues in subsequent commits, asap.

afourney avatar Apr 02 '24 21:04 afourney

Thanks for the feedback @davorrunje Super helpful! Will address issues in subsequent commits, asap.

@afourney these are minor details. Great work, I am looking forward to testing it in production!

davorrunje avatar Apr 03 '24 05:04 davorrunje

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
10404662 Triggered Generic CLI Secret 802f099588bedf1d022b2bba5fb534635df8e6f1 .github/workflows/dotnet-release.yml View secret
10404662 Triggered Generic CLI Secret 8a6ebe1cf8749fd9c501fe0949da824c8262fc84 .github/workflows/dotnet-release.yml View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

gitguardian[bot] avatar May 20 '24 20:05 gitguardian[bot]

@afourney looks like the last thing to fix is the Git LFS. Can you install git lfs?

ekzhu avatar May 23 '24 17:05 ekzhu

@afourney before you merge, please check https://github.com/microsoft/autogen/pull/1929/files#r1574300565 it should support custom bing search api url

Mai0313 avatar May 29 '24 03:05 Mai0313