crewAI ScrapeWebsiteTool cannot handle cookies

Hi,

I (almost) finished yesterday your presentation course on Deeplearning.ai and I was impressed ;) My first try did not succeed, although.

I am just trying to get an agent to analyze a job posting and give a structured output of the requirements, like in L7_job_application_crew.ipynb from the presentation. I just copied the agent and task:

from crewai_tools import ScrapeWebsiteTool

scrape_tool = ScrapeWebsiteTool()
researcher = Agent(
    role="Tech Job Researcher",
    goal="Make sure to do amazing analysis on "
         "job posting to help job applicants",
    tools = [scrape_tool],
    verbose=True,
    backstory=(
        "As a Job Researcher, your prowess in "
        "navigating and extracting critical "
        "information from job postings is unmatched."
        "Your skills help pinpoint the necessary "
        "qualifications and skills sought "
        "by employers, forming the foundation for "
        "effective application tailoring."
    )
)

analyze_task = Task(
    description=(
        "Analyze the job posting URL provided ({job_posting_url}) "
        "to extract key skills, experiences, and qualifications "
        "required. Use the tools to gather content and identify "
        "and categorize the requirements."
    ),
    expected_output=(
        "A structured list of job requirements, including necessary "
        "skills, qualifications, and experiences."
    ),
    agent=researcher,
    # async_execution=True
)

req_crew = Crew(
    agents = [researcher],
    tasks = [analyze_task],
    verbose = True,
    full_output = True
)

inputs = {
    'job_posting_url': 'https://hu.indeed.com/viewjob?jk=44678430abbc6f69&tk=1hufoopq6ojdt85p&from=serp&vjs=3',
}

But when running the crew, the output is:

> Entering new CrewAgentExecutor chain...
I should start by extracting the content of the job posting from the provided URL to analyze the key skills, experiences, and qualifications required.

Action: Read website content
Action Input: {"website_url": "https://hu.indeed.com/viewjob?jk=44678430abbc6f69&tk=1hufoopq6ojdt85p&from=serp&vjs=3"} 

Just a moment...Enable JavaScript and cookies to continue

Final Answer: Just a moment...Enable JavaScript and cookies to continue

> Finished chain.

Did it stuck when asked to enable Javascript and cookies?

May 22 '24 14:05 gkzsolt

Looking at the code of ScrapeWebsiteTool, it does get stuck, indeed. It is a simple requests.get call. By the way, the site's content in question can be obtained even without enabling cookies, but there are also other problems: it redirects (301) and also has some primitive but effective scraping protection.

May 26 '24 19:05 gkzsolt

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Aug 17 '24 12:08 github-actions[bot]

I wanted to follow up on my previous issue as there hasn’t been any response for a while. I understand that the community may be busy, but it’s possible that the ScrapeWebsiteTool works as intended, or perhaps it hasn't been widely used yet.

I would appreciate any insights or guidance you could offer, as I haven’t received much help so far. Thank you for your attention to this matter—I hope to hear from you soon!

Aug 17 '24 19:08 gkzsolt

Many users use the ScrapeWebsiteTool ok, Can you clarify what is happening

You are using the ScrapeWebsiteTool to try and scrape https://hu.indeed.com/viewjob?jk=44678430abbc6f69&tk=1hufoopq6ojdt85p&from=serp&vjs=3 but you are seeing an issue with "Enable JavaScript and cookies to continue"?

IS that URL the one used in the course?

Aug 19 '24 02:08 theCyberTech

You are using the ScrapeWebsiteTool to try and scrape https://hu.indeed.com/viewjob?jk=44678430abbc6f69&tk=1hufoopq6ojdt85p&from=serp&vjs=3 but you are seeing an issue with "Enable JavaScript and cookies to continue"?

Exactly. The url above is outdated now, a current one is https://hu.indeed.com/cmp/Interactive-Brokers?from=mobviewjob&tk=1i5kqbm34jtgm80m&fromjk=5938e909fc3c1ed6&attributionid=mobvjcmp

The AI crew answer to this is:

> Entering new CrewAgentExecutor chain...
I should start by reading the content of the job posting URL to extract key skills, experiences, and qualifications required.

Action: Read website content
Action Input: {"website_url": "https://hu.indeed.com/cmp/Interactive-Brokers?from=mobviewjob&tk=1i5kqbm34jtgm80m&fromjk=5938e909fc3c1ed6&attributionid=mobvjcmp"} 

Security Check - Indeed.com
 Find jobs Company reviews Find salaries Sign in Upload your resume Sign in Employers [/](https://file+.vscode-resource.vscode-cdn.net/) Post Job Find jobs Company reviews Find salaries Additional Verification Required Please turn JavaScript on and reload the page.Please enable Cookies and reload the page. Your Ray ID for this request is 8b58abe5df1f68b5 Need more help? Contact us

Final Answer: Unfortunately, the tool was unable to access the content of the job posting URL provided. As a result, I was unable to extract the key skills, experiences, and qualifications required for the job.

There is an "Accept or reject cookies" layer on the web page, which confuses the ScrapeWebsiteTool, even when the content is still accessible (without acting on the cookies).

IS that URL the one used in the course?

Of course not. I assume the ScrapeWebsiteTool can scrape other URLs as well :-)

Aug 19 '24 08:08 gkzsolt

@theCyberTech, have you had a chance to confirm the issue? Is there any additional information I can provide?

Aug 24 '24 17:08 gkzsolt

Dear Expert, Please help me solve this problem.

from crewai_tools import ScrapeWebsiteTool

# To enable scrapping any website it finds during it's execution
tool = ScrapeWebsiteTool()

# Initialize the tool with the website URL, so the agent can only scrap the content of the specified website
tool = ScrapeWebsiteTool(website_url='https://pitchbook.com/profiles/person/313483-33P')

# Extract the text from the site
text = tool.run()
print(text)

Using Tool: Read website content Just a moment...Enable JavaScript and cookies to continue

from crewai_tools import SeleniumScrapingTool

# Example 1: Initialize the tool without any parameters to scrape the current page it navigates to
tool = SeleniumScrapingTool()

# Example 2: Scrape the entire webpage of a given URL
tool = SeleniumScrapingTool(website_url='https://pitchbook.com/profiles/person/313483-33P')

text = tool.run()
print(text)

Using Tool: Read a website content pitchbook.com Verify you are human by completing the action below. pitchbook.com needs to review the security of your connection before proceeding. Ray ID: 8c57332edd63d1d1 Performance & security by Cloudflare

import requests
from bs4 import BeautifulSoup

# URL of the website
url = 'https://pitchbook.com/profiles/person/313483-33P'

# Send a GET request to the specified URL
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')  

# Extract the text from the site
text = soup.get_text()
print(text)

Just a moment...Enable JavaScript and cookies to continue

Sep 19 '24 09:09 wahidur028

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Oct 19 '24 12:10 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

Oct 24 '24 12:10 github-actions[bot]