Scrapegraph-ai icon indicating copy to clipboard operation
Scrapegraph-ai copied to clipboard

Change proxy rotation function

Open VinciGit00 opened this issue 9 months ago • 8 comments

Is your feature request related to a problem? Please describe. Change proxy rotation function

VinciGit00 avatar May 04 '24 14:05 VinciGit00

@VinciGit00, is that where proxy rotation is supposed to operate?

seems not to be used in the code base right now, replaced by manually passing a proxy IP address.

from scrapegraphai.nodes import FetchNode


if __name__ == "__main__":
    fetcher = FetchNode(
        input="url | local_dir",
        output=["doc"],
        node_config={
            "headless": False,
            "endpoint": "<proxy-IP-address>"
        }
    )

    state = {"url": "https://twitter.com/home"}
    state = fetcher.execute(state)

    print(state)

DiTo97 avatar May 04 '24 23:05 DiTo97

You should import in the main from the utils folder the proxy rotation function or you can manually use yours

VinciGit00 avatar May 05 '24 06:05 VinciGit00

@VinciGit00 I think the proxy is more of a graph attribute than a node attribute. are we heading towards having some graph configs common to all nodes?

DiTo97 avatar May 05 '24 09:05 DiTo97

We prefer node attribute because in this way it is easier to configure custom graphs

VinciGit00 avatar May 05 '24 10:05 VinciGit00

@VinciGit00, so what's the role you are envisioning for params to be common in #125?

DiTo97 avatar May 05 '24 14:05 DiTo97

@VinciGit00, something seems off in the proxy rotation branching, https://github.com/VinciGit00/Scrapegraph-ai/blob/main/scrapegraphai/nodes/fetch_node.py#L79.

.
.
.

if self.node_config is not None and self.node_config.get("endpoint") is not None:
    loader = AsyncChromiumLoader(
        [source],
        proxies={"http": self.node_config["endpoint"]},
        headless=self.headless,
    )
else:
    loader = AsyncChromiumLoader(
        [source],
        headless=self.headless,
    )

.
.
.

if the user provides "endpoint", you pass the proxies argument to AsyncChromiumLoader, but looking at the implementation of that class no proxies parameters seems to be available in the class constructor.

have you ever run the fetch node with a proxy address and ensured that was being used by the webdriver?

if somehow the proxy address is being used, we will switch to bluet/proxybroker2, which is a much more mature library.

DiTo97 avatar May 07 '24 21:05 DiTo97

ok we can switch to that one

VinciGit00 avatar May 08 '24 06:05 VinciGit00

ok we can switch to that one

we can switch in terms of rotation, but the webdriver loader still doesn't accept proxies as argument.

ATM that line raises an exception!

DiTo97 avatar May 08 '24 07:05 DiTo97

Thanks @DiTo97, I have just merged your PR #211 and added it to the documentation

PeriniM avatar May 13 '24 09:05 PeriniM