courtlistener
courtlistener copied to clipboard
Update catching strategy to link session data to proxy connections
Following PACER's confirmation that we should use a static IP address to crawl pages. I've reviewed the code and believe we can link cookies with proxy connections. The log_into_pacer
method is the key for this refactor.
we'll need to refactor the code in the following ways:
-
Add a new argument to the ProxyPacerSession class constructor:
The current
ProxyPacerSession
class constructor retrieves the proxy information from the settings. This limits the flexibility of the class, as it cannot be used with requests that require a different proxy than the one configured in settings.We can achieve greater flexibility by adding a new optional proxy argument to the constructor. This enhancement ensures that all traffic associated with a specific session is directed through the same proxy (we’ll use the same proxy used to create the cookie).
-
Override the login method in the
ProxyPacerSession
:This will allow us to achieve more flexibility in proxy selection, if no proxy is provided during initialization, the overridden login method will handle selecting a suitable proxy before proceeding with the login process. Here's a code snippet demonstrating the override:
def login(self):
if not self.proxy:
# Implement logic to select a suitable proxy
self.proxy = logic_to_select_a_proxy()
return super().login()
-
Tweak the log_into_pacer method:
- Instead of just returning the cookie, we should return both the cookie and the selected proxy. This could be a dictionary or a tuple to hold both values.
-
Store the
cookie-proxy
pair and update caching strategyOur caching strategy needs an update to accommodate the new cookie-proxy pairing. We currently retrieve the cached cookie by simply checking for a key and unpickling it. However, with the refactoring, the cached data could be either:
- A cookie-proxy pair represents the desired outcome, with the login cookie and the associated proxy information.
- Just the cookie (existing approach)
For cached entries containing only the cookie (scenario 2), we can use the flexibility of proxies. Here's the approach:
- We'll randomly choose a proxy from the available pool.
- We'll attempt to complete tasks using this newly selected proxy and the existing cookie.
- If a PacerLoginException occurs, it might indicate an issue with the chosen proxy.
- Since we retry those exceptions, the retrieval logic will have a chance to select a different proxy during the next attempt.
- This way, we can handle existing cookies in the cache while transitioning to the new cookie-proxy pair approach.