Fix 'Event loop is closed' errors during cleanup after crawling operations

Open Wirasm opened this issue 3 months ago • 0 comments

Problem

The backend logs show numerous 'Event loop is closed' RuntimeError exceptions during crawling operations. While these errors are non-fatal and don't prevent the crawling/processing from completing successfully, they create noise in the logs and indicate improper resource management.

Current Behavior

Crawling and code example extraction complete successfully
Summaries are generated correctly using OpenAI's gpt-4.1-nano model
Data is properly stored in the database
After successful completion, cleanup tasks fail with 'Event loop is closed' errors
Errors appear as: Task exception was never retrieved with AsyncClient.aclose() failing

Error Pattern

2025-09-17 12:59:43 | asyncio | ERROR | Task exception was never retrieved
future: <Task finished name='Task-5704' coro=<AsyncClient.aclose() done, defined at /venv/lib/python3.12/site-packages/httpx/_client.py:1978> exception=RuntimeError('Event loop is closed')>
Traceback (most recent call last):
  File "/venv/lib/python3.12/site-packages/httpx/_client.py", line 1985, in aclose
    await self._transport.aclose()
  ...
  File "/usr/local/lib/python3.12/asyncio/base_events.py", line 545, in _check_closed
    raise RuntimeError('Event loop is closed')

Root Cause Analysis

The Issue

Excessive client creation: A new OpenAI/LLM client is created for EACH summary generation operation
- Log pattern: Creating LLM client for provider: openai appears for every single summary
- Each client creates an httpx AsyncClient that needs cleanup
Orphaned cleanup tasks: When the event loop closes (after crawling completes), there are still pending AsyncClient cleanup tasks
- These are fire-and-forget tasks that weren't properly awaited
- They try to execute after their event loop is already closed
Resource lifecycle mismatch: No connection pooling or client reuse strategy

Where to Look

Primary Investigation Areas

LLM Provider Service (python/src/server/services/llm_provider_service.py)
- Check how clients are created/destroyed
- Look for patterns like creating a new client per operation
- Should implement client reuse/pooling
Code Extraction Service (python/src/server/services/crawling/code_extraction_service.py)
- This is where summaries are generated during crawling
- Check how it calls the LLM provider service
- Look for loops that create multiple clients
httpx AsyncClient usage
- Search for AsyncClient creation patterns
- Check if clients are being properly closed with async with or explicit await client.aclose()
- Files to check:
  - python/src/server/services/ollama/model_discovery_service.py
  - python/src/server/services/mcp_service_client.py

Log Evidence

Errors occur between 12:59:30 - 12:59:46 during summary generation
Pattern: Create client → Generate summary → Success logged → Cleanup error
~1291 total errors in one session, but all operations completed successfully

Suggested Fixes

Option 1: Client Pooling (Recommended)

Implement a singleton or pool pattern for LLM clients
Reuse the same OpenAI client across multiple operations
Only create new clients when switching providers

Option 2: Proper Cleanup Coordination

Use async with context managers for all AsyncClient instances
Ensure all cleanup tasks are awaited before the event loop closes
Consider using asyncio.gather() with return_exceptions=True for cleanup

Option 3: Task Lifecycle Management

Track all background tasks
Cancel or await them before shutdown
Use asyncio.create_task() with proper task management

Example Problem Code Pattern

# Current problematic pattern (likely):
async def generate_summary(text):
    client = create_openai_client()  # New client each time!
    result = await client.generate(text)
    # client.aclose() might be scheduled but not awaited
    return result

# Should be:
class SummaryService:
    def __init__(self):
        self.client = create_openai_client()  # Reuse same client
    
    async def generate_summary(self, text):
        return await self.client.generate(text)
    
    async def cleanup(self):
        await self.client.aclose()  # Explicit cleanup

Testing

To reproduce:

Start a crawl operation on any documentation site
Watch logs for Creating LLM client messages
After crawling completes, observe the Event loop is closed errors

To verify fix:

Errors should not appear in logs after crawling
Client creation messages should be minimal
All operations should still complete successfully

Impact

Severity: Low (operations work, but logs are noisy)
Type: Resource Management / Cleanup
Components: LLM Provider Service, Code Extraction, AsyncClient handling

Note: These errors do not affect functionality - all crawling, processing, and storage operations complete successfully. This is purely a cleanup/resource management issue.

Sep 17 '25 13:09 Wirasm