Fixed Timeout in WebPageHelper Could Lead to Incomplete Data Retrieval

Open rmcc3 opened this issue 1 year ago • 1 comments

Fixed Timeout in WebPageHelper Could Lead to Incomplete Data Retrieval

Description

In the file utils.py, the WebPageHelper class uses a fixed timeout of 4 seconds for all HTTP requests:

res = self.httpx_client.get(url, timeout=4)

This fixed timeout can lead to issues with data retrieval, especially when dealing with varying network conditions and server response times.

Why this is problematic

Incomplete Data Retrieval: A fixed 4-second timeout might be too short for some servers or under certain network conditions such as satellite, mobile networks, etc., leading to incomplete data retrieval. This could result in partial or missing information in the knowledge base. Could also be related to issue #88
Inconsistent Performance: The timeout doesn't account for the variability in server response times. Some requests might fail unnecessarily, while others might take longer than needed.
Inefficient Resource Usage: A fixed timeout doesn't allow for optimizing resource usage based on the specific requirements of different requests or the current system load.
Poor Adaptability: The current implementation doesn't adapt to changing network conditions or server responsiveness, which could lead to suboptimal performance in dynamic environments.
Potential Data Bias: If certain types of content consistently take longer to retrieve, a fixed timeout could inadvertently introduce bias into the collected data by systematically excluding this content.

How it affects knowledge curation

Incomplete Knowledge Base: Incomplete data retrieval can lead to gaps in the knowledge base, affecting the quality and comprehensiveness of the curated information.
Unreliable Information Gathering: Inconsistent retrieval of information can lead to unreliable or inconsistent knowledge curation results.
Reduced Efficiency: Unnecessary timeouts on faster responses and premature timeouts on slower but valid responses can significantly reduce the overall efficiency of the knowledge curation process.

Proposed Solution

Implement a more flexible and adaptive timeout strategy:

Dynamic Timeout: Implement a dynamic timeout that adjusts based on factors such as:
- The average response time of the server
- The size of the expected response
- The current network conditions
- The importance or priority of the request
Retry Mechanism: Implement a retry mechanism with exponential backoff for failed requests. This can help handle temporary network issues or server hiccups.
Timeout Configuration: Allow the timeout to be configurable, either through environment variables or a configuration file. This enables easy adjustment without code changes.
Adaptive Timeout: Implement an adaptive timeout system that learns from past request performance and adjusts accordingly.

Example Implementation

Here's a basic example of how this could be implemented:

import backoff
import httpx

class WebPageHelper:
    def __init__(self, base_timeout=4, max_timeout=30):
        self.base_timeout = base_timeout
        self.max_timeout = max_timeout
        self.httpx_client = httpx.Client()

    @backoff.on_exception(backoff.expo, httpx.TimeoutException, max_time=300)
    def get_with_retry(self, url):
        timeout = min(self.base_timeout * 2, self.max_timeout)  # Double the timeout, but cap it
        return self.httpx_client.get(url, timeout=timeout)

    def download_webpage(self, url):
        try:
            res = self.get_with_retry(url)
            if res.status_code >= 400:
                res.raise_for_status()
            return res.content
        except httpx.HTTPError as exc:
            print(f"Error while requesting {exc.request.url!r} - {exc!r}")
            return None

This implementation uses a base timeout that can be doubled (up to a maximum limit) and includes a retry mechanism with exponential backoff.

Action Items

[ ] Implement a dynamic timeout mechanism in the WebPageHelper class
[ ] Add a retry mechanism with exponential backoff for failed requests
[ ] Make the timeout configurable through environment variables or a config file
[ ] Update the documentation to reflect the new timeout behavior
[ ] Add logging to track timeout-related issues and adjust the strategy if needed

Jul 22 '24 06:07 rmcc3

@rmcc3 Thanks for bringing this up! Since we're retrieving from multiple websites simultaneously, a single failure from one website won't have huge impact on the final quality. But your solution it quite reasonable as well. We could incorporate a single retry with relaxed time constraint to mitigate this issue while not impacting the overall waiting time and experience.

Jul 31 '24 17:07 Yucheng-Jiang