Scrapegraph-ai
Scrapegraph-ai copied to clipboard
Change proxy rotation function
Is your feature request related to a problem? Please describe. Change proxy rotation function
@VinciGit00, is that where proxy rotation is supposed to operate?
seems not to be used in the code base right now, replaced by manually passing a proxy IP address.
from scrapegraphai.nodes import FetchNode
if __name__ == "__main__":
fetcher = FetchNode(
input="url | local_dir",
output=["doc"],
node_config={
"headless": False,
"endpoint": "<proxy-IP-address>"
}
)
state = {"url": "https://twitter.com/home"}
state = fetcher.execute(state)
print(state)
You should import in the main from the utils folder the proxy rotation function or you can manually use yours
@VinciGit00 I think the proxy is more of a graph attribute than a node attribute. are we heading towards having some graph configs common to all nodes?
We prefer node attribute because in this way it is easier to configure custom graphs
@VinciGit00, so what's the role you are envisioning for params to be common in #125?
@VinciGit00, something seems off in the proxy rotation branching, https://github.com/VinciGit00/Scrapegraph-ai/blob/main/scrapegraphai/nodes/fetch_node.py#L79.
.
.
.
if self.node_config is not None and self.node_config.get("endpoint") is not None:
loader = AsyncChromiumLoader(
[source],
proxies={"http": self.node_config["endpoint"]},
headless=self.headless,
)
else:
loader = AsyncChromiumLoader(
[source],
headless=self.headless,
)
.
.
.
if the user provides "endpoint"
, you pass the proxies
argument to AsyncChromiumLoader
, but looking at the implementation of that class no proxies
parameters seems to be available in the class constructor.
have you ever run the fetch node with a proxy address and ensured that was being used by the webdriver?
if somehow the proxy address is being used, we will switch to bluet/proxybroker2, which is a much more mature library.
ok we can switch to that one
ok we can switch to that one
we can switch in terms of rotation, but the webdriver loader still doesn't accept proxies as argument.
ATM that line raises an exception!
Thanks @DiTo97, I have just merged your PR #211 and added it to the documentation