Selenium automation page load extremely slow for Premium posts
Nice work, thanks for this libary. Getting a performance issue here - pages are taking forever to load in Selenium / Edge... but do load eventually. Here's my stats currently running after nearly an hour:
4%|███████▉ | 16/387 [43:24<18:02:09, 175.01s/it]
(Line 434: self.driver.get(url) ) taking forever - Edge automation just shows spinning wheel on tab with page apparently fully loading eventually.
I wonder if Substack have implemented anti-bot measures do you think? Have tested network connection, very fast and pages loading fine in BeautfulSoup. Apologies if someone's raised this already.
Hey @reidben - are you running a macbook with apple silicon? It seems like this might be a slowdown because it's using an Intel/x64 version of Edge.
I've had a bit of success switching it out for the Chrome (arm64) driver in substack_scraper.py. To do this, you need chrome installed and the corresponding chromedriver binary in /usr/local/bin (download from: https://developer.chrome.com/docs/chromedriver/downloads).
Just in case it helps - I only started using this project yesterday and am not familiar enough to offer this up as a proper solution!
- options = EdgeOptions()
- if headless:
- options.add_argument("--headless")
- if edge_path:
- options.binary_location = edge_path
- if user_agent:
- options.add_argument(f'user-agent={user_agent}') # Pass this if running headless and blocked by captcha
-
- if edge_driver_path:
- service = Service(executable_path=edge_driver_path)
- else:
- service = Service(EdgeChromiumDriverManager().install())
+ # options = EdgeOptions()
+ # if headless:
+ # options.add_argument("--headless")
+ # if edge_path:
+ # options.binary_location = edge_path
+ # if user_agent:
+ # options.add_argument(f'user-agent={user_agent}') # Pass this if running headless and blocked by captcha
+ #
+ # if edge_driver_path:
+ #
+ # if edge_driver_path:
+ # service = Service(executable_path=edge_driver_path)
+ # else:
+ # service = Service(EdgeChromiumDriverManager().install())
+ #
+ # self.driver = webdriver.Edge(service=service, options=options)
+
+ options = webdriver.ChromeOptions()
+ self.driver = webdriver.Chrome(options=options)
- self.driver = webdriver.Edge(service=service, options=options)
self.login()
Thank you @jontutcher - fantastic suggestion - the speed difference is unbelievable when using chrome based driver.
For Chrome version 134, you don't actually need to manually download the ChromeDriver anymore. Since Chrome v115+, Google has implemented a new way to manage ChromeDriver called Chrome for Testing.
The good news is that current versions of Selenium (4.x) can automatically download the correct ChromeDriver version that matches your Chrome browser. This is handled by webdriver_manager.
@timf34 - consider leveraging
from selenium.webdriver.chrome.options import Options as ChromeOptions
and updating the tool logic appropriately