🐛 [Bug]: Knowledge ingestion fails silently - UUID validation and timeout errors not surfaced to users

Open thiagomaf opened this issue 1 month ago • 0 comments

Archon Version

v0.1.0

Bug Severity

🔴 Critical - App unusable

Bug Description

When attempting to ingest documentation from certain URLs (e.g., https://docs.qnap.com/operating-system/qts/5.2.x/en-us/overview-736AF80D.html), the ingestion process fails silently. The UI shows the crawl starting, then the operation disappears without any error message, leaving users unaware that the ingestion failed.

Investigation revealed TWO related error handling gaps:

Issue 1: Invalid UUID Errors Not Surfaced to Users

The backend receives invalid task IDs (integers like "12", "322", "61" instead of valid UUIDs) when trying to update tasks during knowledge ingestion. These PostgreSQL validation errors are logged to the backend but never surfaced to users:

ERROR | Error updating task: {'message': 'invalid input syntax for type uuid: "12"', 'code': '22P02', ...}
ERROR | Error updating task: {'message': 'invalid input syntax for type uuid: "322"', 'code': '22P02', ...}
ERROR | Error updating task: {'message': 'invalid input syntax for type uuid: "61"', 'code': '22P02', ...}

The errors repeat continuously, ingestion fails, and the operation disappears from the UI.

Issue 2: Crawl Timeout Errors Not Surfaced to Users

Even if Issue 1 were resolved, there's a second error handling gap: when URLs take longer than 30 seconds to load, the crawl times out and this error is also not surfaced to the user. The operation disappears from the UI without explanation.

Both issues share the same root problem: errors are logged in the backend but never reach the user interface.

Steps to Reproduce

Reproducing Issue 1 (UUID Errors):

Go to Knowledge Base page
Click "Add Knowledge"
Enter URL: https://docs.qnap.com/operating-system/qts/5.2.x/en-us/overview-736AF80D.html
Click "Add Source"
Observe ingestion starts but then disappears from UI
Check Docker logs to see UUID validation errors

Reproducing Issue 2 (Timeout Errors):

Use any slow-loading documentation URL (one that takes >30 seconds to load; e.g. https://docs.qnap.com/operating-system/qts/5.2.x/en-us/overview-736AF80D.html)
Follow steps 1-4 above
Wait ~2.5 minutes
Crawl operation disappears from UI without error
Check Docker logs to see timeout error

Expected Behavior

When ANY error occurs during knowledge ingestion (UUID validation, timeouts, network errors, etc.):

✅ Error should be caught at the appropriate layer
✅ Error should be propagated to the progress tracker
✅ Error should be displayed in the UI with a clear, actionable message
✅ Operation should remain visible in an error state (not disappear)
✅ User can understand what went wrong and take corrective action

Examples of Good Error Messages:

Invalid UUID: "Task update failed: Invalid task ID format. Please report this issue with logs."
Timeout: "Crawl failed: Page navigation timeout (30s exceeded). This site may be slow to load or experiencing issues."
Network: "Crawl failed: Unable to reach URL. Check your connection and try again."

Additional Improvements Needed:

Consider making timeout configurable for slow-loading documentation sites
Add retry mechanism for transient failures
Provide troubleshooting suggestions in error messages

Actual Behavior

Issue 1 (UUID Errors):

❌ Invalid UUIDs (integers) passed to task update endpoints
❌ PostgreSQL UUID validation fails at database layer
❌ Errors logged but not caught or handled properly
❌ No validation at API or service boundaries
❌ Operation disappears from UI without error message
❌ User has no indication of what went wrong

Issue 2 (Timeout Errors):

❌ Page times out after 30 seconds during navigation
❌ Crawl4AI raises RuntimeError with timeout details
❌ Error caught and logged: ValueError: No content was crawled from the provided URL
❌ Error is not propagated to progress tracker
❌ Operation disappears from UI without error message
❌ User has no indication of what went wrong

Common Pattern: Errors are logged in backend but never reach the user interface.

Error Details (if any)

#### Error 1: UUID Validation Errors


2025-10-28 19:09:07 | src.server.services.projects.task_service | ERROR | Error updating task: {'message': 'invalid input syntax for type uuid: "12"', 'code': '22P02', 'hint': None, 'details': None}

2025-10-28 19:09:10 | src.server.services.projects.task_service | ERROR | Error updating task: {'message': 'invalid input syntax for type uuid: "322"', 'code': '22P02', 'hint': None, 'details': None}

2025-10-28 19:09:10 | src.server.services.projects.task_service | ERROR | Error updating task: {'message': 'invalid input syntax for type uuid: "61"', 'code': '22P02', 'hint': None, 'details': None}


These errors repeat continuously throughout the ingestion attempt. The PostgreSQL error code `22P02` indicates "invalid input syntax" - the database is receiving integers where it expects UUID format.

#### Error #2: Crawl Timeout Errors


[ERROR]... × https://docs.qnap.com/o.../overview-736AF80D.html  | Error: 
Unexpected error in _crawl_web at line 696 in _crawl_web 
(../venv/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: Timeout 30000ms exceeded.
Call log:
  - navigating to "https://docs.qnap.com/operating-system/qts/5.2.x/en-us/overview-736AF80D.html",
waiting until "domcontentloaded"

Code context:
 691                               tag="GOTO",
 692                               params={"url": url},
 693                           )
 694                           response = None
 695                       else:
 696 →                         raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
 697   

2025-10-28 19:37:37 | src.server.services.crawling.strategies.recursive | WARNING | Failed to crawl https://docs.qnap.com/operating-system/qts/5.2.x/en-us/overview-736AF80D.html: Unexpected error in _crawl_web...

2025-10-28 19:37:37 | src.server.services.crawling.crawling_service | ERROR | Async crawl orchestration failed
Traceback (most recent call last):
  File "/app/src/server/services/crawling/crawling_service.py", line 504, in _async_orchestrate_crawl
    raise ValueError("No content was crawled from the provided URL")
ValueError: No content was crawled from the provided URL


**Note**: After this error is logged, the UI shows no error notification and the crawl operation vanishes.

Affected Component

🔍 Knowledge Base / RAG

Browser & OS

Edge on Windows 10

Additional Context

Related Issues

#825: Similar issue with ingestion; affecting sitemaps with relative URLs
#763: Similar UUID validation issue where MCP server passed "1" instead of valid UUID for project_id
#607 (closed): Similar "No content was crawled" error for different root cause (sitemap parsing issue)

Both of these issues also involved errors being logged but not properly surfaced to users.

Error Flow Analysis

Issue 1 - UUID Errors:

Unknown source passes integers ("12", "322", "61")
    ↓
Task API endpoints receive invalid UUID in path parameter
    ↓
No validation at API boundary
    ↓
Passed to service layer without validation
    ↓
PostgreSQL UUID validation fails
    ↓
Error logged but NOT propagated to progress tracker
    ↓
UI never notified → operation disappears

Issue 2 - Timeout Errors:

Page navigation timeout (30s)
    ↓
crawl4ai → RuntimeError
    ↓
RecursiveCrawlStrategy → logs warning
    ↓
CrawlingService._async_orchestrate_crawl() → raises ValueError
    ↓
Exception caught but NOT sent to progress_tracker.error()
    ↓
UI never notified → operation disappears

Technical Root Causes

Issue 1 - UUID Validation:

No UUID validation at API boundaries: Endpoints accept any string value in path parameters
No validation in service layer: Service methods don't validate UUID format before database operations
PostgreSQL errors not caught early: Validation happens at database level, too late for good error handling
No error propagation: Exception caught but not sent to progress tracker
Unknown source of invalid UUIDs: Still unclear what code is passing these integer values

Issue 2 - Timeout Errors:

Missing error propagation: Exception at line 504 in crawling_service.py is logged but not sent to progress_tracker.error()
Timeout configuration: 30-second timeout may be insufficient for:
- Heavy documentation sites with lots of JavaScript
- Sites with slow initial load times
- Sites behind CDNs with cold cache
- Sites with region-specific routing delays
UI error handling gap: When progress tracker never receives error notification, UI polling sees operation as "disappeared" rather than "failed with error"

Suggested Fixes

For Issue 1 (UUID Validation):

Location: python/src/server/services/projects/task_service.py

Add UUID format validation in update_task() method before database operations
Add comprehensive logging to identify source of invalid UUIDs
Return clear error messages for invalid UUID format

Location: python/src/server/api_routes/projects_api.py

Add UUID validation at API boundary for all task endpoints:
- GET /api/tasks/{task_id}
- PUT /api/tasks/{task_id}
- DELETE /api/tasks/{task_id}
- PUT /api/mcp/tasks/{task_id}/status
Return HTTP 400 (Bad Request) with descriptive errors
Prevent invalid requests from reaching database layer

For Issue 2 (Timeout Errors):

Location: python/src/server/services/crawling/crawling_service.py

Function: _async_orchestrate_crawl()
Lines: ~500-505 (exception handling block)
Change: Add error propagation to progress tracker before raising

Example Fix:

except Exception as e:
    error_message = f"Crawl failed: {str(e)}"
    safe_logfire_error(f"Async crawl orchestration failed | error={error_message}")
    
    # CRITICAL: Notify progress tracker before raising
    if self.progress_tracker:
        await self.progress_tracker.error(
            error_message=error_message,
            error_details={"exception_type": type(e).__name__}
        )
    raise  # Re-raise after notifying tracker

Reproducibility

Consistent: Yes - happens every time with the QNAP docs URL
Timing:
- Issue #1: Errors appear within seconds, repeat continuously
- Issue #2: Takes ~2 minutes 15 seconds before timeout occurs
Other URLs: Issue #1 may be specific to certain sites; Issue #2 affects any slow-loading site

Testing Strategy for Fixes

For Issue 1 (UUID Validation):

Test with the QNAP URL to see if validation catches invalid UUIDs
Verify HTTP 400 errors are returned with clear messages
Confirm logging helps identify source of invalid UUIDs
Test with valid task IDs to ensure no regression

For Issue 2 (Timeout Errors):

Test with the QNAP URL to verify timeout error message appears in UI
Test with fast-loading URLs to verify normal crawls still work
Test with other slow URLs to verify consistent error handling
Verify user can retry or cancel failed operations
Consider testing with configurable timeout values

Service Status (check all that are working)

[x] 🖥️ Frontend UI (http://localhost:3737)
[x] ⚙️ Main Server (http://localhost:8181)
[x] 🔗 MCP Service (localhost:8051)
[ ] 🤖 Agents Service (http://localhost:8052)
[x] 💾 Supabase Database (connected)

Oct 28 '25 20:10 thiagomaf