selenium-wire Cannot start new thread

Cannot start new thread

Open jeromegallego68 opened this issue 2 years ago • 17 comments

Hi,

I'm running selenium wire in a docker container and i'm facing this error : "RuntimeError: can't start new thread". I'm trying to scrape 20 000 urls and after almost 2000 urls, it generate this error.

Hoping you can give me some clue to fix this problem !

If you need anything else to understand the problem, do not hesitate to ask me !

2022-05-19T15:55:35.471Z 2022-05-19 15:55:35 [seleniumwire.server] DEBUG: 127.0.0.1:50548: HTTP2 Event from client
2022-05-19T15:55:35.471Z   -> <SettingsAcknowledged changed_settings:{ChangedSetting(setting=SettingCodes.INITIAL_WINDOW_SIZE, original_value=65535, new_value=1048576), ChangedSetting(setting=SettingCodes.MAX_CONCURRENT_STREAMS, original_value=100, new_value=100), ChangedSetting(setting=SettingCodes.MAX_HEADER_LIST_SIZE, original_value=65536, new_value=65536)}>
2022-05-19T15:55:35.472Z 2022-05-19 15:55:35 [seleniumwire.server] DEBUG: 127.0.0.1:50548: HTTP2 Event from client
2022-05-19T15:55:35.472Z   -> <SettingsAcknowledged changed_settings:{}>
2022-05-19T15:55:35.472Z 2022-05-19 15:55:35 [seleniumwire.server] DEBUG: 127.0.0.1:50550: ALPN for client: b'h2'
2022-05-19T15:55:35.482Z ----------------------------------------
2022-05-19T15:55:35.482Z Error in processing of request from ('127.0.0.1', 50582)
2022-05-19T15:55:35.482Z Traceback (most recent call last):
2022-05-19T15:55:35.482Z   File "/usr/local/lib/python3.8/site-packages/seleniumwire/thirdparty/mitmproxy/net/tcp.py", line 639, in serve_forever
2022-05-19T15:55:35.482Z     t.start()
2022-05-19T15:55:35.482Z   File "/usr/local/lib/python3.8/threading.py", line 852, in start
2022-05-19T15:55:35.482Z     _start_new_thread(self._bootstrap, ())
2022-05-19T15:55:35.482Z RuntimeError: can't start new thread
2022-05-19T15:55:35.482Z ----------------------------------------
2022-05-19T15:55:35.482Z 2022-05-19 15:55:35 [seleniumwire.server] DEBUG: 127.0.0.1:50550: Failed to send error response to client: ClientHandshakeException('Cannot establish TLS with client (sni: fr.realadvisor.com): TlsException("SSL handshake error: Error([(\'SSL routines\', \'\', \'sslv3 alert certificate unknown\')])")')
2022-05-19T15:55:35.482Z 2022-05-19 15:55:35 [seleniumwire.server] DEBUG: 127.0.0.1:50550: serverdisconnect
2022-05-19T15:55:35.482Z   -> ('fr.realadvisor.com', 443)
2022-05-19T15:55:35.487Z ----------------------------------------
2022-05-19T15:55:35.487Z Error in processing of request from ('127.0.0.1', 50584)
2022-05-19T15:55:35.487Z Traceback (most recent call last):
2022-05-19T15:55:35.487Z   File "/usr/local/lib/python3.8/site-packages/seleniumwire/thirdparty/mitmproxy/net/tcp.py", line 639, in serve_forever
2022-05-19T15:55:35.487Z     t.start()
2022-05-19T15:55:35.487Z   File "/usr/local/lib/python3.8/threading.py", line 852, in start
2022-05-19T15:55:35.487Z     _start_new_thread(self._bootstrap, ())
2022-05-19T15:55:35.487Z RuntimeError: can't start new thread
2022-05-19T15:55:35.487Z ----------------------------------------

May 23 '22 13:05 jeromegallego68

I would guess that you've maxed out the number of threads that Python can support in a single process. Can you share the code you're using to initialise the webdriver and iterate through the list of URLs?

May 23 '22 21:05 wkeeling

Hi, thank you for your quick answer !

I combine selenium with Scrapy, here is my Scrapy middleware where I initialize the WebDriver to get the content of the pages.

So basically, for every URL I want to scrape, the "process_request" function is called and a new chrome instance is initialized. I did this in this way to renew the proxies for each URLs to prevent blocked by the website.

import random
import logging

from scrapy.http import HtmlResponse

import seleniumwire.undetected_chromedriver.v2 as uc


class SeleniumMiddleware:
    # Scrapy middleware handling the requests using selenium

    def __init__(self, settings):

        self.iteration = 0

        self.name = 'RandomProxyUAWithSelenium'

        self.proxy_list = settings.get('PROXY_LIST')
        self.ua_list = settings.get('UA_LIST')

        self.proxies = []
        self.uas = []

        fin_proxy = open(self.proxy_list)
        try:
            for line in fin_proxy.readlines():
                self.proxies.append(line)
        finally:
            fin_proxy.close()

        fin_ua = open(self.ua_list)
        try:
            for line in fin_ua.readlines():
                self.uas.append(line.strip())
        finally:
            fin_ua.close()

    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls(crawler.settings)
        return middleware

    def process_request(self, request, spider):
        # Process a request using the selenium driver

        options = uc.ChromeOptions()

        options.add_argument("--headless")
        options.add_argument('--disable-gpu')
        options.add_argument('--disable-dev-shm-usage')

        options.add_argument(f'--user-agent={self.get_random_ua()}')
        # disable popups on startup
        options.add_argument('--no-first-run')
        options.add_argument('--no-service-autorun')
        options.add_argument('--no-default-browser-check')
        options.add_argument('--password-store=basic')

        options.add_argument('--no-proxy-server')

        random_proxy = self.get_random_proxy()

        seleniumwire_options = {
            # 'proxy': {
            #     'http': random_proxy,
            #     'https': random_proxy,
            # }
        }

        driver = uc.Chrome(
            options=options)
        # driver = uc.Chrome(
        #     options=options, seleniumwire_options=seleniumwire_options)

        # def request_interceptor(request):
        #     # Block PNG, JPEG, JPG, WEBP and GIF images
        #     # Block JS
        #     if request.path.endswith(('.png', '.jpg', '.jpeg', '.gif', '.webp', '.js')):
        #         request.abort()

        # driver.request_interceptor = request_interceptor

        driver.get(request.url)

        self.iteration += 1
        logging.debug('[WOUAHOME SCRAPING] lodgment n° :  %s' % self.iteration)
        logging.debug('[WOUAHOME SCRAPING] lodgment url :  %s' % request.url)

        for cookie_name, cookie_value in request.cookies.items():
            driver.add_cookie(
                {
                    'name': cookie_name,
                    'value': cookie_value
                }
            )

        body = str.encode(driver.page_source)

        # Expose the driver via the "meta" attribute
        request.meta.update({'driver': driver, })

        return HtmlResponse(
            url=driver.current_url,
            body=body,
            encoding='utf-8',
            request=request
        )

    def get_random_ua(self):
        return random.choice(self.uas)

    def get_random_proxy(self):
        random_index = random.randint(0, len(self.proxies)-1)
        self.current_index = random_index
        return self.proxies[random_index]

May 24 '22 10:05 jeromegallego68

print("active thread sayısı    =",threading.active_count())
print('RAM memory % used:', psutil.virtual_memory()[2])

after every request check these values and than you will see increasing used memory and activecount

i have same error, after 90-100 requests get same error "cant open new thread"

Jun 08 '22 14:06 muhendis80

@jeromegallego68 I have the same issue.. did you find a solution?

Oct 29 '22 22:10 adirzoari

No, same issue is continous. New version of selenium-wire gives another error.

adirzoari @.***> şunları yazdı (30 Eki 2022 01:28):

@jeromegallego68https://github.com/jeromegallego68 I have the same issue.. did you find a solution?

— Reply to this email directly, view it on GitHubhttps://github.com/wkeeling/selenium-wire/issues/549#issuecomment-1296003684, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHB3G4QA7P4VGZMPP4PD2X3WFWQJVANCNFSM5WV5YXFA. You are receiving this because you commented.Message ID: @.***>

Oct 30 '22 05:10 muhendis80

Has anyone found a solution to the problem?

Jan 13 '23 06:01 deedy5

Sorry, No

deedy5 @.***> şunları yazdı (13 Oca 2023 09:49):

Has anyone found a solution to the problem?

— Reply to this email directly, view it on GitHubhttps://github.com/wkeeling/selenium-wire/issues/549#issuecomment-1381387562, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHB3G4WAKJWDEL3IOWP4UWLWSD3HTANCNFSM5WV5YXFA. You are receiving this because you commented.Message ID: @.***>

Jan 13 '23 07:01 muhendis80

I am working on a possible threading issue also. I am embedding Python in an ASP.NET Framework API and C# using Python.NET and attempting to use the selenium-wire library. ASP.NET is inherently multi-threaded and Python is initialized in the Global.asax.cs Application_Start() which is ostensibly equivalent to the main() method.

I started a thread with my question: https://github.com/wkeeling/selenium-wire/issues/653

Feb 14 '23 16:02 calebTree

I encountered the same problem. Does anyone have any solution or idea? @wkeeling

Feb 28 '23 14:02 donggoing

I did not find a solution

CYD @.***> şunları yazdı (28 Şub 2023 18:00):

I encountered the same problem. Does anyone have any solution or idea? @wkeelinghttps://github.com/wkeeling

— Reply to this email directly, view it on GitHubhttps://github.com/wkeeling/selenium-wire/issues/549#issuecomment-1448335552, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHB3G4WPRMQI4KASIRNMCVLWZYHHXANCNFSM5WV5YXFA. You are receiving this because you commented.Message ID: @.***>

Feb 28 '23 15:02 muhendis80

i have same error, after 90-100 requests get same error "cant open new thread"

Similar observation. I can't open new chrome and get the error "from chrome not reachable" after 90-100 requests.

Feb 28 '23 15:02 donggoing

Me,too. newly opened threads do not terminate themselves, the memory is full. Closing and opening chrome doesn't work

CYD @.***> şunları yazdı (28 Şub 2023 18:06):

i have same error, after 90-100 requests get same error "cant open new thread"

Same observation.

— Reply to this email directly, view it on GitHubhttps://github.com/wkeeling/selenium-wire/issues/549#issuecomment-1448347441, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHB3G4UH7B54ET5Q4T5HH3DWZYH6BANCNFSM5WV5YXFA. You are receiving this because you commented.Message ID: @.***>

Feb 28 '23 15:02 muhendis80

I just need a simple way to use proxy with undetected_chromedriver, if this package doesn't work, I may use seleniumbase.

Feb 28 '23 15:02 donggoing

I just need a simple way to use proxy with undetected_chromedriver, if this package doesn't work, I may use seleniumbase.

Same boat here, what did you end up doing?

Aug 19 '23 05:08 ZacharyHampton

I just need a simple way to use proxy with undetected_chromedriver, if this package doesn't work, I may use seleniumbase.

Same boat here, what did you end up doing?

Use seleniumbase

Aug 19 '23 07:08 donggoing

I just need a simple way to use proxy with undetected_chromedriver, if this package doesn't work, I may use seleniumbase.

Same boat here, what did you end up doing?

Use seleniumbase

Were you still able to use undetected-chromedriver with seleniumbase?

Aug 20 '23 00:08 ZacharyHampton

Nevermind, I see its integrated already, thanks for the info.

Aug 20 '23 01:08 ZacharyHampton

selenium-wire selenium-wire copied to clipboard

Cannot start new thread

selenium-wire
selenium-wire copied to clipboard