ApeRAG icon indicating copy to clipboard operation
ApeRAG copied to clipboard

[BUG] SearchTool Provides Irrelevant Results in Production (HK) vs. Development (CN)

Open iziang opened this issue 3 months ago • 2 comments

Description

A critical issue has been identified where the SearchTool (which appears to use Bing search under the hood) behaves inconsistently between our local development environment (Hangzhou, Mainland China) and our production deployment (Hong Kong).

When a query is made, the development environment retrieves highly relevant search results, leading to correct RAG outputs. However, the exact same query in the production environment retrieves completely irrelevant, junk-like results, which severely degrades the quality and accuracy of the final generated answer. This makes the RAG pipeline unreliable in production.

This is a high-priority bug as it fundamentally breaks the retrieval mechanism of ApeRAG in certain common deployment regions.

Environment Discrepancy

Environment Location Observed Behavior Result Quality
Development Hangzhou, Mainland China The underlying search request to bing.com is redirected to cn.bing.com, which serves correct, localized results. Excellent
Production Hong Kong The search request hits bing.com's global endpoint directly, which returns completely irrelevant results (e.g., Japanese financial data for a weather query). Critical Failure

Steps to Reproduce

The underlying network behavior can be replicated without running the full ApeRAG stack, using cURL to simulate the HTTP requests from the two locations.

  1. Simulate Production (from a Hong Kong server):

    # Query for "拉斯维加斯天气" (Las Vegas weather)
    curl -vL "https://www.bing.com/search?q=%E6%8B%89%E6%96%AF%E7%BB%B4%E5%8A%A0%E6%96%AF%E5%A4%A9%E6%B0%94"
    

    Result: The HTML returned is for unrelated Japanese financial products.

  2. Simulate Development (from a Mainland China server):

    # Same query
    curl -vL "https://www.bing.com/search?q=%E6%8B%89%E6%96%AF%E7%BB%B4%E5%8A%A0%E6%96%AF%E5%A4%A9%E6%B0%94"
    

    Result: The request is redirected to cn.bing.com, and the HTML contains relevant results from sites like zhihu.com.

Root Cause Analysis

The issue stems from how Bing's servers treat programmatic, non-browser requests from different geographic locations:

  1. Geographic Routing: Bing correctly routes traffic to different edge nodes based on IP. The Hangzhou IP is routed to a Mainland China-specific infrastructure, while the Hong Kong IP is routed to a global/HK node.
  2. Client-Side Identity: The HTTP client used by ApeRAG's SearchTool (and cURL) is likely being identified as a "bot" or non-standard client by Bing's global endpoint in Hong Kong. This seems to trigger a fallback or anti-scraping mechanism that serves junk data.
  3. Redirection Difference: The Mainland China infrastructure is configured to redirect all traffic to cn.bing.com, a service optimized for all types of clients. The global infrastructure does not have this behavior, exposing the different treatment of non-browser user agents.

Impact on the ApeRAG Project

  • Unreliable Production Deployments: Any ApeRAG application deployed in Hong Kong (or potentially other regions outside Mainland China) will have a non-functional search/retrieval step.
  • "It Works On My Machine" Problem: This creates a severe discrepancy between development and production, making it difficult to debug and trust local testing.
  • Poor RAG Quality: The core promise of RAG is to provide accurate, context-aware answers. With a faulty retrieval step, the generator produces nonsensical or incorrect information ("garbage in, garbage out").

Proposed Solution / Next Steps

The most direct solution is to make the HTTP requests from ApeRAG's SearchTool appear as if they are coming from a standard web browser.

Recommendation: Modify the HTTP client within the SearchTool to include a standard set of browser headers. At a minimum, this should include:

  • User-Agent: e.g., Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36
  • Accept-Language: e.g., en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7
  • Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7

This change should make Bing's global servers treat the request as a legitimate user interaction, returning relevant results and resolving the environment inconsistency.

A longer-term solution might involve integrating with official, paid search APIs (like the Bing Search API), which are designed for programmatic access and guarantee consistent results.

Supporting Logs

<details>
<summary><b>Full cURL Log from Hong Kong (Production Simulation)</b></summary>

hk.txt log content here...

< HTTP/2 200 ... < x-msedge-ref: Ref A: 1B95296EBD2143C0BFF4AA1CAA2697DB Ref B: HKBEDGE0908 Ref C: 2025-09-19T02:05:53Z ...


</details>

<details>
<summary><b>Full cURL Log from Hangzhou (Development Simulation)</b></summary>

hz.txt log content here...

< HTTP/2 302 < location: https://cn.bing.com/search?q=%E6%8B%89%E6%96%AF%E7%BB%B4%E5%8A%A0%E6%96%AF%E5%A4%A9%E6%B0%94 ... < HTTP/2 200 ... < x-msedge-ref: Ref A: 766A2412CFA14F638355C0212634C9D0 Ref B: BJ1EDGE0719 Ref C: 2025-09-19T02:05:32Z ...


</details>

iziang avatar Sep 19 '25 02:09 iziang

Root Cause Identified: DuckDuckGo Search Provider Redirect Handling Issue

I've identified the actual root cause, which is different from the initial Bing anti-bot hypothesis.

The Real Issue

  1. Production Environment: Hong Kong deployment lacks JINA API key configuration
  2. Fallback Mechanism: ApeRAG falls back to DuckDuckGo search provider when JINA is unavailable
  3. Geographic Redirect Problem: DuckDuckGo internally forwards to Bing, but handles redirects differently:
    • Mainland China: bing.comcn.bing.com (302 redirect) ✅ Works
    • Hong Kong: bing.com serves directly (no redirect) ❌ Fails
  4. Library Limitation: The duckduckgo-search library doesn't properly handle these geographic redirect differences

Why Development Works vs Production Fails

Environment JINA API Key Search Provider Result
Development (Hangzhou) ✅ Configured JINA (primary) ✅ Success
Production (Hong Kong) ❌ Missing DuckDuckGo (fallback) ❌ Failure

Immediate Solution

Configure JINA API keys in your Hong Kong production environment:

# Production config
providers:
  jina:
    api_key: "your-jina-api-key"

This will bypass the problematic DuckDuckGo fallback entirely and use JINA's robust search infrastructure.

Alternative Solutions

  1. Enhanced DuckDuckGo Provider: Improve redirect handling for geographic differences
  2. Direct Bing Search API: Implement official Bing Search API integration
  3. Smart Fallback Logic: Add region-aware search provider selection

Verification

After configuring JINA API keys, test the same queries that previously failed. The search should work consistently across all geographic regions.

iziang avatar Sep 19 '25 03:09 iziang

This issue has been marked as stale because it has been open for 30 days with no activity

github-actions[bot] avatar Oct 27 '25 00:10 github-actions[bot]