courtlistener icon indicating copy to clipboard operation
courtlistener copied to clipboard

Update catching strategy to link session data to proxy connections

Open ERosendo opened this issue 8 months ago • 1 comments

Following PACER's confirmation that we should use a static IP address to crawl pages. I've reviewed the code and believe we can link cookies with proxy connections. The log_into_pacer method is the key for this refactor.

we'll need to refactor the code in the following ways:

  • Add a new argument to the ProxyPacerSession class constructor:

    The current ProxyPacerSession class constructor retrieves the proxy information from the settings. This limits the flexibility of the class, as it cannot be used with requests that require a different proxy than the one configured in settings.

    We can achieve greater flexibility by adding a new optional proxy argument to the constructor. This enhancement ensures that all traffic associated with a specific session is directed through the same proxy (we’ll use the same proxy used to create the cookie).

  • Override the login method in the ProxyPacerSession:

    This will allow us to achieve more flexibility in proxy selection, if no proxy is provided during initialization, the overridden login method will handle selecting a suitable proxy before proceeding with the login process. Here's a code snippet demonstrating the override:

def login(self):
        if not self.proxy:
           # Implement logic to select a suitable proxy
           self.proxy = logic_to_select_a_proxy()
        return super().login()
  • Tweak the log_into_pacer method:

    • Instead of just returning the cookie, we should return both the cookie and the selected proxy. This could be a dictionary or a tuple to hold both values.
  • Store the cookie-proxy pair and update caching strategy

    Our caching strategy needs an update to accommodate the new cookie-proxy pairing. We currently retrieve the cached cookie by simply checking for a key and unpickling it. However, with the refactoring, the cached data could be either:

    1. A cookie-proxy pair represents the desired outcome, with the login cookie and the associated proxy information.
    2. Just the cookie (existing approach)

    For cached entries containing only the cookie (scenario 2), we can use the flexibility of proxies. Here's the approach:

    1. We'll randomly choose a proxy from the available pool.
    2. We'll attempt to complete tasks using this newly selected proxy and the existing cookie.
    3. If a PacerLoginException occurs, it might indicate an issue with the chosen proxy.
    4. Since we retry those exceptions, the retrieval logic will have a chance to select a different proxy during the next attempt.
    5. This way, we can handle existing cookies in the cache while transitioning to the new cookie-proxy pair approach.

ERosendo avatar May 30 '24 15:05 ERosendo