🐛 [Bug]: Knowledge ingestion fails silently - UUID validation and timeout errors not surfaced to users
Archon Version
v0.1.0
Bug Severity
🔴 Critical - App unusable
Bug Description
When attempting to ingest documentation from certain URLs (e.g., https://docs.qnap.com/operating-system/qts/5.2.x/en-us/overview-736AF80D.html), the ingestion process fails silently. The UI shows the crawl starting, then the operation disappears without any error message, leaving users unaware that the ingestion failed.
Investigation revealed TWO related error handling gaps:
Issue 1: Invalid UUID Errors Not Surfaced to Users
The backend receives invalid task IDs (integers like "12", "322", "61" instead of valid UUIDs) when trying to update tasks during knowledge ingestion. These PostgreSQL validation errors are logged to the backend but never surfaced to users:
ERROR | Error updating task: {'message': 'invalid input syntax for type uuid: "12"', 'code': '22P02', ...}
ERROR | Error updating task: {'message': 'invalid input syntax for type uuid: "322"', 'code': '22P02', ...}
ERROR | Error updating task: {'message': 'invalid input syntax for type uuid: "61"', 'code': '22P02', ...}
The errors repeat continuously, ingestion fails, and the operation disappears from the UI.
Issue 2: Crawl Timeout Errors Not Surfaced to Users
Even if Issue 1 were resolved, there's a second error handling gap: when URLs take longer than 30 seconds to load, the crawl times out and this error is also not surfaced to the user. The operation disappears from the UI without explanation.
Both issues share the same root problem: errors are logged in the backend but never reach the user interface.
Steps to Reproduce
Reproducing Issue 1 (UUID Errors):
- Go to Knowledge Base page
- Click "Add Knowledge"
- Enter URL:
https://docs.qnap.com/operating-system/qts/5.2.x/en-us/overview-736AF80D.html - Click "Add Source"
- Observe ingestion starts but then disappears from UI
- Check Docker logs to see UUID validation errors
Reproducing Issue 2 (Timeout Errors):
- Use any slow-loading documentation URL (one that takes >30 seconds to load; e.g.
https://docs.qnap.com/operating-system/qts/5.2.x/en-us/overview-736AF80D.html) - Follow steps 1-4 above
- Wait ~2.5 minutes
- Crawl operation disappears from UI without error
- Check Docker logs to see timeout error
Expected Behavior
When ANY error occurs during knowledge ingestion (UUID validation, timeouts, network errors, etc.):
- ✅ Error should be caught at the appropriate layer
- ✅ Error should be propagated to the progress tracker
- ✅ Error should be displayed in the UI with a clear, actionable message
- ✅ Operation should remain visible in an error state (not disappear)
- ✅ User can understand what went wrong and take corrective action
Examples of Good Error Messages:
- Invalid UUID: "Task update failed: Invalid task ID format. Please report this issue with logs."
- Timeout: "Crawl failed: Page navigation timeout (30s exceeded). This site may be slow to load or experiencing issues."
- Network: "Crawl failed: Unable to reach URL. Check your connection and try again."
Additional Improvements Needed:
- Consider making timeout configurable for slow-loading documentation sites
- Add retry mechanism for transient failures
- Provide troubleshooting suggestions in error messages
Actual Behavior
Issue 1 (UUID Errors):
- ❌ Invalid UUIDs (integers) passed to task update endpoints
- ❌ PostgreSQL UUID validation fails at database layer
- ❌ Errors logged but not caught or handled properly
- ❌ No validation at API or service boundaries
- ❌ Operation disappears from UI without error message
- ❌ User has no indication of what went wrong
Issue 2 (Timeout Errors):
- ❌ Page times out after 30 seconds during navigation
- ❌ Crawl4AI raises
RuntimeErrorwith timeout details - ❌ Error caught and logged:
ValueError: No content was crawled from the provided URL - ❌ Error is not propagated to progress tracker
- ❌ Operation disappears from UI without error message
- ❌ User has no indication of what went wrong
Common Pattern: Errors are logged in backend but never reach the user interface.
Error Details (if any)
#### Error 1: UUID Validation Errors
2025-10-28 19:09:07 | src.server.services.projects.task_service | ERROR | Error updating task: {'message': 'invalid input syntax for type uuid: "12"', 'code': '22P02', 'hint': None, 'details': None}
2025-10-28 19:09:10 | src.server.services.projects.task_service | ERROR | Error updating task: {'message': 'invalid input syntax for type uuid: "322"', 'code': '22P02', 'hint': None, 'details': None}
2025-10-28 19:09:10 | src.server.services.projects.task_service | ERROR | Error updating task: {'message': 'invalid input syntax for type uuid: "61"', 'code': '22P02', 'hint': None, 'details': None}
These errors repeat continuously throughout the ingestion attempt. The PostgreSQL error code `22P02` indicates "invalid input syntax" - the database is receiving integers where it expects UUID format.
#### Error #2: Crawl Timeout Errors
[ERROR]... × https://docs.qnap.com/o.../overview-736AF80D.html | Error:
Unexpected error in _crawl_web at line 696 in _crawl_web
(../venv/lib/python3.12/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: Timeout 30000ms exceeded.
Call log:
- navigating to "https://docs.qnap.com/operating-system/qts/5.2.x/en-us/overview-736AF80D.html",
waiting until "domcontentloaded"
Code context:
691 tag="GOTO",
692 params={"url": url},
693 )
694 response = None
695 else:
696 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
697
2025-10-28 19:37:37 | src.server.services.crawling.strategies.recursive | WARNING | Failed to crawl https://docs.qnap.com/operating-system/qts/5.2.x/en-us/overview-736AF80D.html: Unexpected error in _crawl_web...
2025-10-28 19:37:37 | src.server.services.crawling.crawling_service | ERROR | Async crawl orchestration failed
Traceback (most recent call last):
File "/app/src/server/services/crawling/crawling_service.py", line 504, in _async_orchestrate_crawl
raise ValueError("No content was crawled from the provided URL")
ValueError: No content was crawled from the provided URL
**Note**: After this error is logged, the UI shows no error notification and the crawl operation vanishes.
Affected Component
🔍 Knowledge Base / RAG
Browser & OS
Edge on Windows 10
Additional Context
Related Issues
- #825: Similar issue with ingestion; affecting sitemaps with relative URLs
- #763: Similar UUID validation issue where MCP server passed "1" instead of valid UUID for project_id
- #607 (closed): Similar "No content was crawled" error for different root cause (sitemap parsing issue)
Both of these issues also involved errors being logged but not properly surfaced to users.
Error Flow Analysis
Issue 1 - UUID Errors:
Unknown source passes integers ("12", "322", "61")
↓
Task API endpoints receive invalid UUID in path parameter
↓
No validation at API boundary
↓
Passed to service layer without validation
↓
PostgreSQL UUID validation fails
↓
Error logged but NOT propagated to progress tracker
↓
UI never notified → operation disappears
Issue 2 - Timeout Errors:
Page navigation timeout (30s)
↓
crawl4ai → RuntimeError
↓
RecursiveCrawlStrategy → logs warning
↓
CrawlingService._async_orchestrate_crawl() → raises ValueError
↓
Exception caught but NOT sent to progress_tracker.error()
↓
UI never notified → operation disappears
Technical Root Causes
Issue 1 - UUID Validation:
- No UUID validation at API boundaries: Endpoints accept any string value in path parameters
- No validation in service layer: Service methods don't validate UUID format before database operations
- PostgreSQL errors not caught early: Validation happens at database level, too late for good error handling
- No error propagation: Exception caught but not sent to progress tracker
- Unknown source of invalid UUIDs: Still unclear what code is passing these integer values
Issue 2 - Timeout Errors:
- Missing error propagation: Exception at line 504 in
crawling_service.pyis logged but not sent toprogress_tracker.error() - Timeout configuration: 30-second timeout may be insufficient for:
- Heavy documentation sites with lots of JavaScript
- Sites with slow initial load times
- Sites behind CDNs with cold cache
- Sites with region-specific routing delays
- UI error handling gap: When progress tracker never receives error notification, UI polling sees operation as "disappeared" rather than "failed with error"
Suggested Fixes
For Issue 1 (UUID Validation):
Location: python/src/server/services/projects/task_service.py
- Add UUID format validation in
update_task()method before database operations - Add comprehensive logging to identify source of invalid UUIDs
- Return clear error messages for invalid UUID format
Location: python/src/server/api_routes/projects_api.py
- Add UUID validation at API boundary for all task endpoints:
GET /api/tasks/{task_id}PUT /api/tasks/{task_id}DELETE /api/tasks/{task_id}PUT /api/mcp/tasks/{task_id}/status
- Return HTTP 400 (Bad Request) with descriptive errors
- Prevent invalid requests from reaching database layer
For Issue 2 (Timeout Errors):
Location: python/src/server/services/crawling/crawling_service.py
- Function:
_async_orchestrate_crawl() - Lines: ~500-505 (exception handling block)
- Change: Add error propagation to progress tracker before raising
Example Fix:
except Exception as e:
error_message = f"Crawl failed: {str(e)}"
safe_logfire_error(f"Async crawl orchestration failed | error={error_message}")
# CRITICAL: Notify progress tracker before raising
if self.progress_tracker:
await self.progress_tracker.error(
error_message=error_message,
error_details={"exception_type": type(e).__name__}
)
raise # Re-raise after notifying tracker
Reproducibility
- Consistent: Yes - happens every time with the QNAP docs URL
- Timing:
- Issue #1: Errors appear within seconds, repeat continuously
- Issue #2: Takes ~2 minutes 15 seconds before timeout occurs
- Other URLs: Issue #1 may be specific to certain sites; Issue #2 affects any slow-loading site
Testing Strategy for Fixes
For Issue 1 (UUID Validation):
- Test with the QNAP URL to see if validation catches invalid UUIDs
- Verify HTTP 400 errors are returned with clear messages
- Confirm logging helps identify source of invalid UUIDs
- Test with valid task IDs to ensure no regression
For Issue 2 (Timeout Errors):
- Test with the QNAP URL to verify timeout error message appears in UI
- Test with fast-loading URLs to verify normal crawls still work
- Test with other slow URLs to verify consistent error handling
- Verify user can retry or cancel failed operations
- Consider testing with configurable timeout values
Service Status (check all that are working)
- [x] 🖥️ Frontend UI (http://localhost:3737)
- [x] ⚙️ Main Server (http://localhost:8181)
- [x] 🔗 MCP Service (localhost:8051)
- [ ] 🤖 Agents Service (http://localhost:8052)
- [x] 💾 Supabase Database (connected)