selenium-wire
selenium-wire copied to clipboard
Loading website 8x slower using multi-thread
I am using aws EC2 instance (Ubuntu) for scraping. It has been 5x slower since I switch to selenium-wire but using multi-threads is even worse. Scraping one webpage has been 0.8 sec(selenium) -> 5.7 sec (selenium-wire) -> 40 sec (selenium-wire + multi-threading). I am not sure how to identify the issue andI already tried options listed in previous issue but still no luck. Could you help me with this? here is my code:
from webdriver_manager.chrome import ChromeDriverManager
sw_options = {
'connection_keep_alive': True,
'disable_capture': True,
'disable_encoding': True
}
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options = webdriver.ChromeOptions()
browser = webdriver.Chrome(options=chrome_options,
seleniumwire_options=sw_options,
executable_path=ChromeDriverManager().install())
versions: python 3.6.9 selenium 3.141.0 selenium-wire 4.2.4 chrome-driver: 90.0.4430.24
Thank you so much!
Selenium Wire works by routing traffic through an internal proxy it spins up in the background. The additional I/O overhead of that means that Selenium Wire will always run more slowly than pure Selenium. That said, 5.7 seconds does seem slow for a single webpage. Are you using an upstream proxy server?
One thing you could do to try and improve the performance is to abort requests you're not interested in, e.g. image requests. That would prevent them from passing through Selenium Wire and would reduce overall I/O.
For that you'd need to switch disable_capture
to False
and configure a request interceptor to abort the requests, e.g.
sw_options = {
'connection_keep_alive': True,
'disable_capture': False,
'disable_encoding': True
}
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options = webdriver.ChromeOptions()
browser = webdriver.Chrome(options=chrome_options,
seleniumwire_options=sw_options,
executable_path=ChromeDriverManager().install())
def interceptor(request):
if request.path.endswith(('.png', '.jpg', '.gif')):
request.abort()
driver.request_interceptor = interceptor
You could get creative with the interceptor and abort other types of requests too. The above example assumes that image paths end with particular file extensions, but it may be necessary to tweak that for the sites you are scraping.
Sorry I accidentally hit some key and close this issue... Thanks for the quick response! I have seen many unrelated request in the process. I will definitely try this out. But the main issue I am facing is using multithread makes the process 8x slower than using the single-thread selenium-wire. I am not using proxy yet.
I rearrange my code as following:
import time
from seleniumwire import webdriver
from webdriver_manager.chrome import ChromeDriverManager
def get_driver():
sw_options = {
'connection_keep_alive': True,
'disable_capture': True,
'disable_encoding': True
}
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options = webdriver.ChromeOptions()
browser = webdriver.Chrome(options=chrome_options,
seleniumwire_options=sw_options,
executable_path=ChromeDriverManager().install())
def scrape_a_page(url):
dirver = get_driver()
start_time = time.time()
driver.get(url)
# find_elements and scrape
print(f"Scraping page {url} time: {time.time() - start_time}")
with ThreadPoolExecutor(max_workers=num_thread) as executor:
results = executor.map(scrape_a_page, [url1, url2, ...])
I did a quick update to abort requesting image but still face the same issue:
num_thread = 1
Scraping page url1 time: 5.0881
num_thread = 4
Scraping page url1 time: 19.62
Scraping page url1 time: 38.56
Scraping page url1 time: 38.98
Scraping page url1 time: 40.23
Yes it is strange that it is as slow as that. Are you definitely using version 4.2.4 of Selenium Wire? The reason I ask is that I notice you've specified connection_keep_alive
in your options. That option was removed a while back, so I wonder whether you'd previously been using an older Selenium Wire version? The older versions did run more slowly.
Yes, I can confirm I am using version 4.2.4. I probably found that option somewhere in a discussion thread and decided to give it a try.