Scrapegraph-ai icon indicating copy to clipboard operation
Scrapegraph-ai copied to clipboard

About SearchLinkGraph

Open salihai opened this issue 9 months ago • 7 comments

Even though I applied the filtering parameter, SearchLinkGraph continues to give links with extensions such as .css and .js. What should I do?

graph_config = {

"llm": {
    "model": "ollama/llama3.2",  
    "temperature": 0,  
    "format": "json",  
    "base_url": "http://localhost:11434",  
},
"verbose": True, 
"headless": False, 
"filter_config": {  
    "diff_domain_filter": True,
    "img_exts": [".jpg", ".jpeg", ".png", ".gif", ".bmp", ".svg", ".webp", ".ico"],
    "lang_indicators": ["lang=", "/tr", "/en", "/fr", "/pt", "/es", "/de", "/jp", "/it"],
    "irrelevant_keywords": [
        "/login",
        "/signup",
        "/register",
        "/contact",
        "facebook.net",
        "twitter.com",
        "linkedin.com",
        "instagram.com",
        ".js",
        ".css",
    ],
},

}

salihai avatar Feb 03 '25 19:02 salihai

Hey there, @salihai! 👋 I'm here to help you with any bugs, questions, or contributions while waiting for a human maintainer.

To ensure that the SearchLinkGraph filters out links with extensions such as .css and .js, you should verify that the filter_links setting is enabled. The filtering for irrelevant keywords, including file extensions, is controlled by the filter_links flag. Make sure it is set to True in your configuration.

Here's how you can adjust your configuration:

graph_config = {
    "llm": {
        "model": "ollama/llama3.2",  
        "temperature": 0,  
        "format": "json",  
        "base_url": "<link url='http://localhost:11434'>Webpage description not available.</link>",  
    },
    "verbose": True, 
    "headless": False, 
    "filter_links": True,  # Ensure this is set to True
    "filter_config": {  
        "diff_domain_filter": True,
        "img_exts": [".jpg", ".jpeg", ".png", ".gif", ".bmp", ".svg", ".webp", ".ico"],
        "lang_indicators": ["lang=", "/tr", "/en", "/fr", "/pt", "/es", "/de", "/jp", "/it"],
        "irrelevant_keywords": [
            "/login",
            "/signup",
            "/register",
            "/contact",
            "facebook.net",
            "twitter.com",
            "linkedin.com",
            "instagram.com",
            ".js",
            ".css",
        ],
    },
}

By setting filter_links to True, the filtering mechanism will be activated, and links containing .js and .css should be filtered out as specified in the irrelevant_keywords list [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

dosubot[bot] avatar Feb 03 '25 19:02 dosubot[bot]

Adding this didn't work.

salihai avatar Feb 03 '25 19:02 salihai

from smart_scraper_multi_graph import SmartScraperMultiGraph

prompt = "What is Chioggia famous for?" source = ["https://en.wikipedia.org/wiki/Chioggia", "https://example.com"] config = { "llm": {"model": "gpt-3.5-turbo"} }

smart_scraper_multi = SmartScraperMultiGraph(prompt, source, config)

result = smart_scraper_multi.run()

print(result)

When i ran this code, i got this error message:

[WinError 10061] No connection could be made because the target machine actively refused it

What should i do to solve this?

salihai avatar Feb 05 '25 14:02 salihai

The error message "[WinError 10061] No connection could be made because the target machine actively refused it" typically indicates that the server you are trying to connect to is not accepting connections. Here are some steps you can take to resolve this issue:

  1. Check Server Availability: Ensure that the server you are trying to connect to is running and accessible. If you are running a local server, make sure it is started and listening on the correct port.

  2. Verify URL and Port: Double-check the URL and port you are using in your code. Ensure that they are correct and that the server is configured to accept connections on that port.

  3. Network and Firewall Settings: Check your network connection and firewall settings. Ensure that your firewall or any security software is not blocking the connection to the server.

  4. Localhost Configuration: If you are using a local server (e.g., localhost), ensure that it is configured to accept connections from your application. Sometimes, local servers are configured to only accept connections from certain IP addresses.

  5. Service Status: If you are using a third-party service or API, check their status page to see if there are any ongoing issues or maintenance that might be affecting connectivity.

By addressing these areas, you should be able to resolve the connection issue.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

dosubot[bot] avatar Feb 05 '25 14:02 dosubot[bot]

Hi @salihai, I will update

VinciGit00 avatar Feb 13 '25 11:02 VinciGit00

Hi @salihai please update to the new beta

VinciGit00 avatar Feb 13 '25 11:02 VinciGit00

Hi, @salihai. I'm Dosu, and I'm helping the Scrapegraph-ai team manage their backlog. I'm marking this issue as stale.

Issue Summary:

  • You reported that the filter_links parameter in SearchLinkGraph is not filtering out .css and .js links.
  • I suggested verifying the configuration, but you encountered a connection error ([WinError 10061]).
  • I provided troubleshooting steps for the connection issue.
  • @VinciGit00 recommended updating to the new beta version to potentially resolve the problem.

Next Steps:

  • Please let us know if this issue is still relevant to the latest version of the Scrapegraph-ai repository by commenting here.
  • If there is no further activity, this issue will be automatically closed in 7 days.

Thank you for your understanding and contribution!

dosubot[bot] avatar May 15 '25 16:05 dosubot[bot]