scrapy-rotating-proxies icon indicating copy to clipboard operation
scrapy-rotating-proxies copied to clipboard

Proxies Stuck in unchecked state

Open john-parton opened this issue 7 years ago • 11 comments

After running the crawler for over a day, I still have a lot of proxies in the "unchecked" state.

[rotating_proxies.middlewares] INFO: Proxies(good: 147, dead: 3226, unchecked: 524, reanimated: 167, mean backoff time: 4254s)

It looks like those 524 unchecked proxies are just timing out, but they're not getting moved to dead, so a lot of time is wasted sending requests to them.

I set my timeout pretty low with DOWNLOAD_TIMEOUT = 15.

Let me know if you need anything from me: parts of my crawler, settings, etc.

Thanks.

Edit: I have the BanDetectionMiddleware installed.

DOWNLOADER_MIDDLEWARES = {
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}

john-parton avatar Aug 30 '17 13:08 john-parton

I have this problem too, lots of unchecked proxies, but I have no dead ones.

[rotating_proxies.middlewares] INFO: Proxies(good: 97, dead: 0, unchecked: 97, reanimated: 6, mean backoff time: 0s)

Edit:

I think the 'problem' is that the proxy is loaded randomly between good and unchecked ones.

The main issue here is that I have a DOWNLOAD_DELAY set to 1000 seconds, according to the docs, it should be set per-proxy now. So am I wrong when saying that, in theory, if I have 100 proxies, they should each begin with a request and then every one of them has its own 1000 second delay?

If so, getting a new proxy randomly from good un unchecked ones would slow down the spider. In theory you could end up randomly getting only 10 of the 100 proxies each time get_random() is called, so you'd wait 1000 seconds per each of the the 10 proxies, and having 90 unused proxies.

Thoughts on this?

peterlupu avatar Sep 11 '18 10:09 peterlupu

sorry to shamelessly bump, but bump?

peterlupu avatar Nov 23 '18 23:11 peterlupu

Have the same problem, bump

pioter83 avatar Mar 22 '19 12:03 pioter83

Got the same problem. Have another question too. Does scrapy wait until all the unchecked count becomes 0 to start using good proxies? image image Because max retry count is 5. It looks like a good proxy has not been used 6 times.

ErangaD avatar May 02 '19 06:05 ErangaD

shameless bump

wittedhaddock avatar Oct 22 '19 18:10 wittedhaddock

I have the same problem

mapb1994 avatar Oct 29 '19 15:10 mapb1994

Same issue. For me, it appears that it's kind of de-duplicating proxies based off the host and port. So if they're the same, they remain unchecked and only one is used.

danjdewhurst avatar Jan 09 '20 11:01 danjdewhurst

How about something like this?

import random
from rotating_proxies.middlewares import RotatingProxyMiddleware
from rotating_proxies.expire import Proxies

class MyRotatingProxiesMiddleware(RotatingProxyMiddleware):
    def __init__(self, proxy_list, logstats_interval, stop_if_no_proxies, max_proxies_to_try, backoff_base, backoff_cap):
        super().__init__(proxy_list, logstats_interval, stop_if_no_proxies, max_proxies_to_try, backoff_base, backoff_cap)
        self.proxies = MyProxies(self.cleanup_proxy_list(proxy_list), backoff=self.proxies.backoff)

class MyProxies(Proxies):
    def __init__(self, proxy_list, backoff=None):
        super().__init__(proxy_list, backoff)
        self.chosen = []

    def get_random(self):
        available = list(self.unchecked | self.good)

        if not available:
            return None

        # generate unused proxy list from unchecked+good, excluding already used ones
        not_picked_yet = [x for x in available if x not in self.chosen]
        if not not_picked_yet:
            # if the list is empty, reset the chosen list and generate again
            # only happens when i completely went through all of the good+unchecked proxies
            self.chosen = []
            not_picked_yet = [x for x in available if x not in self.chosen]

        # randomly pick a proxy from the 'good' list
        chosen_proxy = random.choice(not_picked_yet)
        # mark as chosen
        self.chosen.append(chosen_proxy)
        return chosen_proxy


Then use MyRotatingProxiesMiddleware.

peterlupu avatar Jan 09 '20 11:01 peterlupu

How about something like this?

import random
from rotating_proxies.middlewares import RotatingProxyMiddleware
from rotating_proxies.expire import Proxies

class MyRotatingProxiesMiddleware(RotatingProxyMiddleware):
    def __init__(self, proxy_list, logstats_interval, stop_if_no_proxies, max_proxies_to_try, backoff_base, backoff_cap):
        super().__init__(proxy_list, logstats_interval, stop_if_no_proxies, max_proxies_to_try, backoff_base, backoff_cap)
        self.proxies = MyProxies(self.cleanup_proxy_list(proxy_list), backoff=self.proxies.backoff)

class MyProxies(Proxies):
    def __init__(self, proxy_list, backoff=None):
        super().__init__(proxy_list, backoff)
        self.chosen = []

    def get_random(self):
        available = list(self.unchecked | self.good)

        if not available:
            return None

        # generate unused proxy list from unchecked+good, excluding already used ones
        not_picked_yet = [x for x in available if x not in self.chosen]
        if not not_picked_yet:
            # if the list is empty, reset the chosen list and generate again
            # only happens when i completely went through all of the good+unchecked proxies
            self.chosen = []
            not_picked_yet = [x for x in available if x not in self.chosen]

        # randomly pick a proxy from the 'good' list
        chosen_proxy = random.choice(not_picked_yet)
        # mark as chosen
        self.chosen.append(chosen_proxy)
        return chosen_proxy

Then use MyRotatingProxiesMiddleware.

Did this work for you?

rajatshenoy56 avatar Aug 19 '20 15:08 rajatshenoy56

bump

timpal0l avatar Apr 27 '22 11:04 timpal0l

How about something like this?

import random
from rotating_proxies.middlewares import RotatingProxyMiddleware
from rotating_proxies.expire import Proxies

class MyRotatingProxiesMiddleware(RotatingProxyMiddleware):
    def __init__(self, proxy_list, logstats_interval, stop_if_no_proxies, max_proxies_to_try, backoff_base, backoff_cap):
        super().__init__(proxy_list, logstats_interval, stop_if_no_proxies, max_proxies_to_try, backoff_base, backoff_cap)
        self.proxies = MyProxies(self.cleanup_proxy_list(proxy_list), backoff=self.proxies.backoff)

class MyProxies(Proxies):
    def __init__(self, proxy_list, backoff=None):
        super().__init__(proxy_list, backoff)
        self.chosen = []

    def get_random(self):
        available = list(self.unchecked | self.good)

        if not available:
            return None

        # generate unused proxy list from unchecked+good, excluding already used ones
        not_picked_yet = [x for x in available if x not in self.chosen]
        if not not_picked_yet:
            # if the list is empty, reset the chosen list and generate again
            # only happens when i completely went through all of the good+unchecked proxies
            self.chosen = []
            not_picked_yet = [x for x in available if x not in self.chosen]

        # randomly pick a proxy from the 'good' list
        chosen_proxy = random.choice(not_picked_yet)
        # mark as chosen
        self.chosen.append(chosen_proxy)
        return chosen_proxy

Then use MyRotatingProxiesMiddleware.

Did this work for you?

iirc, yes

bump

please check my solution above - could still be working

peterlupu avatar Apr 27 '22 11:04 peterlupu