selenium-wire Loading website 8x slower using multi-thread

I am using aws EC2 instance (Ubuntu) for scraping. It has been 5x slower since I switch to selenium-wire but using multi-threads is even worse. Scraping one webpage has been 0.8 sec(selenium) -> 5.7 sec (selenium-wire) -> 40 sec (selenium-wire + multi-threading). I am not sure how to identify the issue andI already tried options listed in previous issue but still no luck. Could you help me with this? here is my code:

from webdriver_manager.chrome import ChromeDriverManager
sw_options = {
    'connection_keep_alive': True,
    'disable_capture': True,
    'disable_encoding': True
}
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options = webdriver.ChromeOptions()
browser = webdriver.Chrome(options=chrome_options,
                               seleniumwire_options=sw_options,
                               executable_path=ChromeDriverManager().install())

versions: python 3.6.9 selenium 3.141.0 selenium-wire 4.2.4 chrome-driver: 90.0.4430.24

Thank you so much!

May 01 '21 12:05 mohsu

Selenium Wire works by routing traffic through an internal proxy it spins up in the background. The additional I/O overhead of that means that Selenium Wire will always run more slowly than pure Selenium. That said, 5.7 seconds does seem slow for a single webpage. Are you using an upstream proxy server?

One thing you could do to try and improve the performance is to abort requests you're not interested in, e.g. image requests. That would prevent them from passing through Selenium Wire and would reduce overall I/O.

For that you'd need to switch disable_capture to False and configure a request interceptor to abort the requests, e.g.

sw_options = {
    'connection_keep_alive': True,
    'disable_capture': False,
    'disable_encoding': True
}
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options = webdriver.ChromeOptions()
browser = webdriver.Chrome(options=chrome_options,
                               seleniumwire_options=sw_options,
                               executable_path=ChromeDriverManager().install())

def interceptor(request):
    if request.path.endswith(('.png', '.jpg', '.gif')):
        request.abort()

driver.request_interceptor = interceptor

You could get creative with the interceptor and abort other types of requests too. The above example assumes that image paths end with particular file extensions, but it may be necessary to tweak that for the sites you are scraping.

May 01 '21 13:05 wkeeling

Sorry I accidentally hit some key and close this issue... Thanks for the quick response! I have seen many unrelated request in the process. I will definitely try this out. But the main issue I am facing is using multithread makes the process 8x slower than using the single-thread selenium-wire. I am not using proxy yet.

I rearrange my code as following:

import time
from seleniumwire import webdriver
from webdriver_manager.chrome import ChromeDriverManager
def get_driver():
  sw_options = {
      'connection_keep_alive': True,
      'disable_capture': True,
      'disable_encoding': True
  }
  chrome_options.add_argument("--headless")
  chrome_options.add_argument("--no-sandbox")
  chrome_options.add_argument("--disable-gpu")
  chrome_options.add_argument("--disable-dev-shm-usage")
  chrome_options = webdriver.ChromeOptions()
  browser = webdriver.Chrome(options=chrome_options,
                                 seleniumwire_options=sw_options,
                                 executable_path=ChromeDriverManager().install())

def scrape_a_page(url):
  dirver = get_driver()
  start_time = time.time()
  driver.get(url)
  # find_elements and scrape
  print(f"Scraping page {url} time: {time.time() - start_time}")

with ThreadPoolExecutor(max_workers=num_thread) as executor:
        results = executor.map(scrape_a_page, [url1, url2, ...])

I did a quick update to abort requesting image but still face the same issue: num_thread = 1 Scraping page url1 time: 5.0881

num_thread = 4 Scraping page url1 time: 19.62 Scraping page url1 time: 38.56 Scraping page url1 time: 38.98 Scraping page url1 time: 40.23

May 01 '21 23:05 mohsu

Yes it is strange that it is as slow as that. Are you definitely using version 4.2.4 of Selenium Wire? The reason I ask is that I notice you've specified connection_keep_alive in your options. That option was removed a while back, so I wonder whether you'd previously been using an older Selenium Wire version? The older versions did run more slowly.

May 03 '21 13:05 wkeeling

Yes, I can confirm I am using version 4.2.4. I probably found that option somewhere in a discussion thread and decided to give it a try.

May 03 '21 14:05 mohsu

selenium-wire selenium-wire copied to clipboard

Loading website 8x slower using multi-thread

selenium-wire
selenium-wire copied to clipboard