Scrapegraph-ai
Scrapegraph-ai copied to clipboard
About SearchLinkGraph
Even though I applied the filtering parameter, SearchLinkGraph continues to give links with extensions such as .css and .js. What should I do?
graph_config = {
"llm": {
"model": "ollama/llama3.2",
"temperature": 0,
"format": "json",
"base_url": "http://localhost:11434",
},
"verbose": True,
"headless": False,
"filter_config": {
"diff_domain_filter": True,
"img_exts": [".jpg", ".jpeg", ".png", ".gif", ".bmp", ".svg", ".webp", ".ico"],
"lang_indicators": ["lang=", "/tr", "/en", "/fr", "/pt", "/es", "/de", "/jp", "/it"],
"irrelevant_keywords": [
"/login",
"/signup",
"/register",
"/contact",
"facebook.net",
"twitter.com",
"linkedin.com",
"instagram.com",
".js",
".css",
],
},
}
Hey there, @salihai! 👋 I'm here to help you with any bugs, questions, or contributions while waiting for a human maintainer.
To ensure that the SearchLinkGraph filters out links with extensions such as .css and .js, you should verify that the filter_links setting is enabled. The filtering for irrelevant keywords, including file extensions, is controlled by the filter_links flag. Make sure it is set to True in your configuration.
Here's how you can adjust your configuration:
graph_config = {
"llm": {
"model": "ollama/llama3.2",
"temperature": 0,
"format": "json",
"base_url": "<link url='http://localhost:11434'>Webpage description not available.</link>",
},
"verbose": True,
"headless": False,
"filter_links": True, # Ensure this is set to True
"filter_config": {
"diff_domain_filter": True,
"img_exts": [".jpg", ".jpeg", ".png", ".gif", ".bmp", ".svg", ".webp", ".ico"],
"lang_indicators": ["lang=", "/tr", "/en", "/fr", "/pt", "/es", "/de", "/jp", "/it"],
"irrelevant_keywords": [
"/login",
"/signup",
"/register",
"/contact",
"facebook.net",
"twitter.com",
"linkedin.com",
"instagram.com",
".js",
".css",
],
},
}
By setting filter_links to True, the filtering mechanism will be activated, and links containing .js and .css should be filtered out as specified in the irrelevant_keywords list [1].
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other
Adding this didn't work.
from smart_scraper_multi_graph import SmartScraperMultiGraph
prompt = "What is Chioggia famous for?" source = ["https://en.wikipedia.org/wiki/Chioggia", "https://example.com"] config = { "llm": {"model": "gpt-3.5-turbo"} }
smart_scraper_multi = SmartScraperMultiGraph(prompt, source, config)
result = smart_scraper_multi.run()
print(result)
When i ran this code, i got this error message:
[WinError 10061] No connection could be made because the target machine actively refused it
What should i do to solve this?
The error message "[WinError 10061] No connection could be made because the target machine actively refused it" typically indicates that the server you are trying to connect to is not accepting connections. Here are some steps you can take to resolve this issue:
-
Check Server Availability: Ensure that the server you are trying to connect to is running and accessible. If you are running a local server, make sure it is started and listening on the correct port.
-
Verify URL and Port: Double-check the URL and port you are using in your code. Ensure that they are correct and that the server is configured to accept connections on that port.
-
Network and Firewall Settings: Check your network connection and firewall settings. Ensure that your firewall or any security software is not blocking the connection to the server.
-
Localhost Configuration: If you are using a local server (e.g.,
localhost), ensure that it is configured to accept connections from your application. Sometimes, local servers are configured to only accept connections from certain IP addresses. -
Service Status: If you are using a third-party service or API, check their status page to see if there are any ongoing issues or maintenance that might be affecting connectivity.
By addressing these areas, you should be able to resolve the connection issue.
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other
Hi @salihai, I will update
Hi @salihai please update to the new beta
Hi, @salihai. I'm Dosu, and I'm helping the Scrapegraph-ai team manage their backlog. I'm marking this issue as stale.
Issue Summary:
- You reported that the
filter_linksparameter in SearchLinkGraph is not filtering out .css and .js links. - I suggested verifying the configuration, but you encountered a connection error ([WinError 10061]).
- I provided troubleshooting steps for the connection issue.
- @VinciGit00 recommended updating to the new beta version to potentially resolve the problem.
Next Steps:
- Please let us know if this issue is still relevant to the latest version of the Scrapegraph-ai repository by commenting here.
- If there is no further activity, this issue will be automatically closed in 7 days.
Thank you for your understanding and contribution!