Archon icon indicating copy to clipboard operation
Archon copied to clipboard

Fix 'Event loop is closed' errors during cleanup after crawling operations

Open Wirasm opened this issue 3 months ago • 0 comments

Problem

The backend logs show numerous 'Event loop is closed' RuntimeError exceptions during crawling operations. While these errors are non-fatal and don't prevent the crawling/processing from completing successfully, they create noise in the logs and indicate improper resource management.

Current Behavior

  • Crawling and code example extraction complete successfully
  • Summaries are generated correctly using OpenAI's gpt-4.1-nano model
  • Data is properly stored in the database
  • After successful completion, cleanup tasks fail with 'Event loop is closed' errors
  • Errors appear as: Task exception was never retrieved with AsyncClient.aclose() failing

Error Pattern

2025-09-17 12:59:43 | asyncio | ERROR | Task exception was never retrieved
future: <Task finished name='Task-5704' coro=<AsyncClient.aclose() done, defined at /venv/lib/python3.12/site-packages/httpx/_client.py:1978> exception=RuntimeError('Event loop is closed')>
Traceback (most recent call last):
  File "/venv/lib/python3.12/site-packages/httpx/_client.py", line 1985, in aclose
    await self._transport.aclose()
  ...
  File "/usr/local/lib/python3.12/asyncio/base_events.py", line 545, in _check_closed
    raise RuntimeError('Event loop is closed')

Root Cause Analysis

The Issue

  1. Excessive client creation: A new OpenAI/LLM client is created for EACH summary generation operation

    • Log pattern: Creating LLM client for provider: openai appears for every single summary
    • Each client creates an httpx AsyncClient that needs cleanup
  2. Orphaned cleanup tasks: When the event loop closes (after crawling completes), there are still pending AsyncClient cleanup tasks

    • These are fire-and-forget tasks that weren't properly awaited
    • They try to execute after their event loop is already closed
  3. Resource lifecycle mismatch: No connection pooling or client reuse strategy

Where to Look

Primary Investigation Areas

  1. LLM Provider Service (python/src/server/services/llm_provider_service.py)

    • Check how clients are created/destroyed
    • Look for patterns like creating a new client per operation
    • Should implement client reuse/pooling
  2. Code Extraction Service (python/src/server/services/crawling/code_extraction_service.py)

    • This is where summaries are generated during crawling
    • Check how it calls the LLM provider service
    • Look for loops that create multiple clients
  3. httpx AsyncClient usage

    • Search for AsyncClient creation patterns
    • Check if clients are being properly closed with async with or explicit await client.aclose()
    • Files to check:
      • python/src/server/services/ollama/model_discovery_service.py
      • python/src/server/services/mcp_service_client.py

Log Evidence

  • Errors occur between 12:59:30 - 12:59:46 during summary generation
  • Pattern: Create client → Generate summary → Success logged → Cleanup error
  • ~1291 total errors in one session, but all operations completed successfully

Suggested Fixes

Option 1: Client Pooling (Recommended)

  • Implement a singleton or pool pattern for LLM clients
  • Reuse the same OpenAI client across multiple operations
  • Only create new clients when switching providers

Option 2: Proper Cleanup Coordination

  • Use async with context managers for all AsyncClient instances
  • Ensure all cleanup tasks are awaited before the event loop closes
  • Consider using asyncio.gather() with return_exceptions=True for cleanup

Option 3: Task Lifecycle Management

  • Track all background tasks
  • Cancel or await them before shutdown
  • Use asyncio.create_task() with proper task management

Example Problem Code Pattern

# Current problematic pattern (likely):
async def generate_summary(text):
    client = create_openai_client()  # New client each time!
    result = await client.generate(text)
    # client.aclose() might be scheduled but not awaited
    return result

# Should be:
class SummaryService:
    def __init__(self):
        self.client = create_openai_client()  # Reuse same client
    
    async def generate_summary(self, text):
        return await self.client.generate(text)
    
    async def cleanup(self):
        await self.client.aclose()  # Explicit cleanup

Testing

To reproduce:

  1. Start a crawl operation on any documentation site
  2. Watch logs for Creating LLM client messages
  3. After crawling completes, observe the Event loop is closed errors

To verify fix:

  1. Errors should not appear in logs after crawling
  2. Client creation messages should be minimal
  3. All operations should still complete successfully

Impact

  • Severity: Low (operations work, but logs are noisy)
  • Type: Resource Management / Cleanup
  • Components: LLM Provider Service, Code Extraction, AsyncClient handling

Note: These errors do not affect functionality - all crawling, processing, and storage operations complete successfully. This is purely a cleanup/resource management issue.

Wirasm avatar Sep 17 '25 13:09 Wirasm