Scrapegraph-ai icon indicating copy to clipboard operation
Scrapegraph-ai copied to clipboard

Unable to scrape some sites like https://www.cnn.com/

Open praveenspdocu3c opened this issue 9 months ago • 1 comments

I've been working on developing a web scraper model using LLMs (ScrapeGraph-AI) in AzureChatOpenAI - langchain_openai. It's been tested with various webpage URLs and performs admirably. However, there's an issue with specific webpages like https://www.cnn.com/ and https://olympics.com/. For these sites, ScrapeGraph-AI returns an empty value as output. Could anyone assist me in resolving this issue? Thanks in advance.

Example code used from GitHub :

import os from dotenv import load_dotenv from langchain_openai import AzureChatOpenAI from langchain_openai import AzureOpenAIEmbeddings from scrapegraphai.graphs import SmartScraperGraph from scrapegraphai.utils import prettify_exec_info

load_dotenv()

llm_model_instance = AzureChatOpenAI( openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"], azure_deployment=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"] )

embedder_model_instance = AzureOpenAIEmbeddings( azure_deployment=os.environ["AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME"], openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"], )

graph_config = { "llm": {"model_instance": llm_model_instance}, "embeddings": {"model_instance": embedder_model_instance} }

smart_scraper_graph = SmartScraperGraph( prompt="""List me all the events, with the following fields: company_name, event_name, event_start_date, event_start_time, event_end_date, event_end_time, location, event_mode, event_category, third_party_redirect, no_of_days, time_in_hours, hosted_or_attending, refreshments_type, registration_available, registration_link""", # also accepts a string with the already downloaded HTML code source="https://www.hmhco.com/event", config=graph_config )

result = smart_scraper_graph.run() print(result)

praveenspdocu3c avatar May 13 '24 18:05 praveenspdocu3c

Hey try setting the following in the graph_config and see the output

"verbose":True,
"headless":False

PeriniM avatar May 13 '24 19:05 PeriniM