selenium-wire
selenium-wire copied to clipboard
Cannot start new thread
Hi,
I'm running selenium wire in a docker container and i'm facing this error : "RuntimeError: can't start new thread". I'm trying to scrape 20 000 urls and after almost 2000 urls, it generate this error.
Hoping you can give me some clue to fix this problem !
If you need anything else to understand the problem, do not hesitate to ask me !
2022-05-19T15:55:35.471Z 2022-05-19 15:55:35 [seleniumwire.server] DEBUG: 127.0.0.1:50548: HTTP2 Event from client
2022-05-19T15:55:35.471Z -> <SettingsAcknowledged changed_settings:{ChangedSetting(setting=SettingCodes.INITIAL_WINDOW_SIZE, original_value=65535, new_value=1048576), ChangedSetting(setting=SettingCodes.MAX_CONCURRENT_STREAMS, original_value=100, new_value=100), ChangedSetting(setting=SettingCodes.MAX_HEADER_LIST_SIZE, original_value=65536, new_value=65536)}>
2022-05-19T15:55:35.472Z 2022-05-19 15:55:35 [seleniumwire.server] DEBUG: 127.0.0.1:50548: HTTP2 Event from client
2022-05-19T15:55:35.472Z -> <SettingsAcknowledged changed_settings:{}>
2022-05-19T15:55:35.472Z 2022-05-19 15:55:35 [seleniumwire.server] DEBUG: 127.0.0.1:50550: ALPN for client: b'h2'
2022-05-19T15:55:35.482Z ----------------------------------------
2022-05-19T15:55:35.482Z Error in processing of request from ('127.0.0.1', 50582)
2022-05-19T15:55:35.482Z Traceback (most recent call last):
2022-05-19T15:55:35.482Z File "/usr/local/lib/python3.8/site-packages/seleniumwire/thirdparty/mitmproxy/net/tcp.py", line 639, in serve_forever
2022-05-19T15:55:35.482Z t.start()
2022-05-19T15:55:35.482Z File "/usr/local/lib/python3.8/threading.py", line 852, in start
2022-05-19T15:55:35.482Z _start_new_thread(self._bootstrap, ())
2022-05-19T15:55:35.482Z RuntimeError: can't start new thread
2022-05-19T15:55:35.482Z ----------------------------------------
2022-05-19T15:55:35.482Z 2022-05-19 15:55:35 [seleniumwire.server] DEBUG: 127.0.0.1:50550: Failed to send error response to client: ClientHandshakeException('Cannot establish TLS with client (sni: fr.realadvisor.com): TlsException("SSL handshake error: Error([(\'SSL routines\', \'\', \'sslv3 alert certificate unknown\')])")')
2022-05-19T15:55:35.482Z 2022-05-19 15:55:35 [seleniumwire.server] DEBUG: 127.0.0.1:50550: serverdisconnect
2022-05-19T15:55:35.482Z -> ('fr.realadvisor.com', 443)
2022-05-19T15:55:35.487Z ----------------------------------------
2022-05-19T15:55:35.487Z Error in processing of request from ('127.0.0.1', 50584)
2022-05-19T15:55:35.487Z Traceback (most recent call last):
2022-05-19T15:55:35.487Z File "/usr/local/lib/python3.8/site-packages/seleniumwire/thirdparty/mitmproxy/net/tcp.py", line 639, in serve_forever
2022-05-19T15:55:35.487Z t.start()
2022-05-19T15:55:35.487Z File "/usr/local/lib/python3.8/threading.py", line 852, in start
2022-05-19T15:55:35.487Z _start_new_thread(self._bootstrap, ())
2022-05-19T15:55:35.487Z RuntimeError: can't start new thread
2022-05-19T15:55:35.487Z ----------------------------------------
I would guess that you've maxed out the number of threads that Python can support in a single process. Can you share the code you're using to initialise the webdriver and iterate through the list of URLs?
Hi, thank you for your quick answer !
I combine selenium with Scrapy, here is my Scrapy middleware where I initialize the WebDriver to get the content of the pages.
So basically, for every URL I want to scrape, the "process_request" function is called and a new chrome instance is initialized. I did this in this way to renew the proxies for each URLs to prevent blocked by the website.
import random
import logging
from scrapy.http import HtmlResponse
import seleniumwire.undetected_chromedriver.v2 as uc
class SeleniumMiddleware:
# Scrapy middleware handling the requests using selenium
def __init__(self, settings):
self.iteration = 0
self.name = 'RandomProxyUAWithSelenium'
self.proxy_list = settings.get('PROXY_LIST')
self.ua_list = settings.get('UA_LIST')
self.proxies = []
self.uas = []
fin_proxy = open(self.proxy_list)
try:
for line in fin_proxy.readlines():
self.proxies.append(line)
finally:
fin_proxy.close()
fin_ua = open(self.ua_list)
try:
for line in fin_ua.readlines():
self.uas.append(line.strip())
finally:
fin_ua.close()
@classmethod
def from_crawler(cls, crawler):
middleware = cls(crawler.settings)
return middleware
def process_request(self, request, spider):
# Process a request using the selenium driver
options = uc.ChromeOptions()
options.add_argument("--headless")
options.add_argument('--disable-gpu')
options.add_argument('--disable-dev-shm-usage')
options.add_argument(f'--user-agent={self.get_random_ua()}')
# disable popups on startup
options.add_argument('--no-first-run')
options.add_argument('--no-service-autorun')
options.add_argument('--no-default-browser-check')
options.add_argument('--password-store=basic')
options.add_argument('--no-proxy-server')
random_proxy = self.get_random_proxy()
seleniumwire_options = {
# 'proxy': {
# 'http': random_proxy,
# 'https': random_proxy,
# }
}
driver = uc.Chrome(
options=options)
# driver = uc.Chrome(
# options=options, seleniumwire_options=seleniumwire_options)
# def request_interceptor(request):
# # Block PNG, JPEG, JPG, WEBP and GIF images
# # Block JS
# if request.path.endswith(('.png', '.jpg', '.jpeg', '.gif', '.webp', '.js')):
# request.abort()
# driver.request_interceptor = request_interceptor
driver.get(request.url)
self.iteration += 1
logging.debug('[WOUAHOME SCRAPING] lodgment n° : %s' % self.iteration)
logging.debug('[WOUAHOME SCRAPING] lodgment url : %s' % request.url)
for cookie_name, cookie_value in request.cookies.items():
driver.add_cookie(
{
'name': cookie_name,
'value': cookie_value
}
)
body = str.encode(driver.page_source)
# Expose the driver via the "meta" attribute
request.meta.update({'driver': driver, })
return HtmlResponse(
url=driver.current_url,
body=body,
encoding='utf-8',
request=request
)
def get_random_ua(self):
return random.choice(self.uas)
def get_random_proxy(self):
random_index = random.randint(0, len(self.proxies)-1)
self.current_index = random_index
return self.proxies[random_index]
print("active thread sayısı =",threading.active_count())
print('RAM memory % used:', psutil.virtual_memory()[2])
after every request check these values and than you will see increasing used memory and activecount
i have same error, after 90-100 requests get same error "cant open new thread"
@jeromegallego68 I have the same issue.. did you find a solution?
No, same issue is continous. New version of selenium-wire gives another error.
adirzoari @.***> şunları yazdı (30 Eki 2022 01:28):
@jeromegallego68https://github.com/jeromegallego68 I have the same issue.. did you find a solution?
— Reply to this email directly, view it on GitHubhttps://github.com/wkeeling/selenium-wire/issues/549#issuecomment-1296003684, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHB3G4QA7P4VGZMPP4PD2X3WFWQJVANCNFSM5WV5YXFA. You are receiving this because you commented.Message ID: @.***>
Has anyone found a solution to the problem?
Sorry, No
deedy5 @.***> şunları yazdı (13 Oca 2023 09:49):
Has anyone found a solution to the problem?
— Reply to this email directly, view it on GitHubhttps://github.com/wkeeling/selenium-wire/issues/549#issuecomment-1381387562, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHB3G4WAKJWDEL3IOWP4UWLWSD3HTANCNFSM5WV5YXFA. You are receiving this because you commented.Message ID: @.***>
I am working on a possible threading issue also. I am embedding Python in an ASP.NET Framework API and C# using Python.NET and attempting to use the selenium-wire
library. ASP.NET is inherently multi-threaded and Python is initialized in the Global.asax.cs
Application_Start()
which is ostensibly equivalent to the main()
method.
I started a thread with my question: https://github.com/wkeeling/selenium-wire/issues/653
I encountered the same problem. Does anyone have any solution or idea? @wkeeling
I did not find a solution
CYD @.***> şunları yazdı (28 Şub 2023 18:00):
I encountered the same problem. Does anyone have any solution or idea? @wkeelinghttps://github.com/wkeeling
— Reply to this email directly, view it on GitHubhttps://github.com/wkeeling/selenium-wire/issues/549#issuecomment-1448335552, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHB3G4WPRMQI4KASIRNMCVLWZYHHXANCNFSM5WV5YXFA. You are receiving this because you commented.Message ID: @.***>
i have same error, after 90-100 requests get same error "cant open new thread"
Similar observation. I can't open new chrome and get the error "from chrome not reachable" after 90-100 requests.
Me,too. newly opened threads do not terminate themselves, the memory is full. Closing and opening chrome doesn't work
CYD @.***> şunları yazdı (28 Şub 2023 18:06):
i have same error, after 90-100 requests get same error "cant open new thread"
Same observation.
— Reply to this email directly, view it on GitHubhttps://github.com/wkeeling/selenium-wire/issues/549#issuecomment-1448347441, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHB3G4UH7B54ET5Q4T5HH3DWZYH6BANCNFSM5WV5YXFA. You are receiving this because you commented.Message ID: @.***>
I just need a simple way to use proxy with undetected_chromedriver, if this package doesn't work, I may use seleniumbase.
I just need a simple way to use proxy with undetected_chromedriver, if this package doesn't work, I may use seleniumbase.
Same boat here, what did you end up doing?
I just need a simple way to use proxy with undetected_chromedriver, if this package doesn't work, I may use seleniumbase.
Same boat here, what did you end up doing?
Use seleniumbase
I just need a simple way to use proxy with undetected_chromedriver, if this package doesn't work, I may use seleniumbase.
Same boat here, what did you end up doing?
Use seleniumbase
Were you still able to use undetected-chromedriver with seleniumbase?
Nevermind, I see its integrated already, thanks for the info.