scylla icon indicating copy to clipboard operation
scylla copied to clipboard

no proxy crawled

Open ericxsun opened this issue 5 years ago • 6 comments

Please provide the following information if applicable:

  • Operating system and its version: Mac 10.10.5, python3.6.6

  • Version number of Scylla: 1.1.5

no proxy is crawled

2019-01-15 - 13:05:59 DEBUG: create new db connection
2019-01-15 - 13:05:59 INFO: Scheduler starts...
2019-01-15 - 13:05:59 DEBUG: feed 8 providers...
2019-01-15 - 13:05:59 INFO: Start python scheduler
2019-01-15 - 13:05:59 INFO: worker_process started
2019-01-15 - 13:05:59 INFO: validator_thread started
2019-01-15 - 13:05:59 DEBUG: fetch_ips...
2019-01-15 - 13:05:59 INFO: Start the web server
[2019-01-15 13:05:59 +0800] [98963] [INFO] Goin' Fast @ http://0.0.0.0:8899
2019-01-15 - 13:05:59 DEBUG: Get a provider from the provider queue: A2uProvider
[2019-01-15 13:05:59 +0800] [98963] [INFO] Starting worker [98963]
2019-01-15 - 13:05:59 INFO: Start forward proxy server on port 8081
2019-01-15 - 13:06:59 DEBUG: Feed 0 proxies from the database for a second time validation

ericxsun avatar Jan 15 '19 05:01 ericxsun

Is your server located in mainland China?

imWildCat avatar Jan 15 '19 06:01 imWildCat

yes, in mainland China. Even if I set a vpn on my router, there is no proxy crawled.

ericxsun avatar Jan 16 '19 00:01 ericxsun

Could you please you a oversea server?

imWildCat avatar Jan 16 '19 01:01 imWildCat

ThKs, I'll try.

ericxsun avatar Jan 16 '19 01:01 ericxsun

if i want to using it in mainland China, just write numbers of provider in scylla/providers and rebuild? that's right?

but, when I provided one like the following

class CNProxyComProvider(BaseProvider):

    def urls(self) -> [str]:
        return [
            'https://cn-proxy.com/',
            'https://cn-proxy.com/archives/218'
        ]

    def parse(self, html: HTML) -> [ProxyIP]:
        ip_list: [ProxyIP] = []

        for ip_row in html.find('table tbody tr'):
            ip_element = ip_row.find('td:nth-child(1)', first=True)
            port_element = ip_row.find('td:nth-child(2)', first=True)

            try:
                if ip_element and port_element:
                    ip = re.search(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', ip_element.text).group(0)
                    port = re.search('\d{2,5}', port_element.text).group(0)
                    p = ProxyIP(ip=ip, port=port)

                    ip_list.append(p)
            except AttributeError:
                pass

        return ip_list

    @staticmethod
    def should_render_js() -> bool:
        return False

and add it to __init__.py

from .cn_proxy_com_provider import CNProxyComProvider
all_providers = [
    CNProxyComProvider,
    A2uProvider,
    ....
]

It does not crawl any proxy, although I can parse some proxys with request_htmls in command line. log of scylla:

2019-01-18 - 18:46:54 DEBUG: create new db connection
2019-01-18 - 18:46:55 INFO: Scheduler starts...
2019-01-18 - 18:46:55 DEBUG: feed 9 providers...
2019-01-18 - 18:46:55 INFO: Start python scheduler
2019-01-18 - 18:46:55 INFO: worker_process started
2019-01-18 - 18:46:55 INFO: validator_thread started
2019-01-18 - 18:46:55 DEBUG: fetch_ips...
2019-01-18 - 18:46:55 DEBUG: Get a provider from the provider queue: CNProxyComProvider
2019-01-18 - 18:46:55 INFO: Start the web server
[2019-01-18 18:46:55 +0800] [97416] [INFO] Goin' Fast @ http://0.0.0.0:8899
2019-01-18 - 18:46:55 INFO: Start forward proxy server on port 8081
[2019-01-18 18:46:55 +0800] [97416] [INFO] Starting worker [97416]
2019-01-18 - 18:47:55 DEBUG: Feed 0 proxies from the database for a second time validation

ericxsun avatar Jan 18 '19 10:01 ericxsun

I have the same problem in windows server 2016

ccfleaf avatar Jun 21 '21 03:06 ccfleaf