Clarify source_id naming and generation strategy differences between URLs and files
Problem
The codebase uses source_id for both URL-based crawls and file uploads, but with fundamentally different generation strategies and purposes:
Current Implementation
URL source_id:
- Location:
python/src/server/services/crawling/helpers/url_handler.py:186-240 - Method: Deterministic SHA256 hash of canonical URL (16 chars)
- Purpose: Deduplication - same URL always generates same ID
- Example:
https://example.com/page→a3f5b8c9d2e1f0a7
File source_id:
- Location:
python/src/server/api_routes/knowledge_api.py:599 - Method: Random UUID suffix (8 chars) - previously used timestamp
- Purpose: Versioning - allow multiple uploads of same file
- Example:
document.pdf→file_document_pdf_6a1d3948
This naming overlap creates conceptual confusion since they serve opposite purposes (deduplication vs versioning).
Impact
- Developer confusion: Same field name suggests similar behavior
- Maintenance risk: Future developers might incorrectly assume unified behavior
- API inconsistency: Different ID patterns for same field
Solution Options
Option 1: Documentation Only (Pragmatic)
Effort: Low | Risk: None | Breaking: No
Add clear comments explaining the different strategies:
# URL sources: Deterministic hash for deduplication
# Same URL → Same ID (prevents duplicate crawls)
source_id = UrlHandler.generate_unique_source_id(url)
# File sources: Random UUID for versioning
# Same file → Different IDs (allows re-uploads)
source_id = f"file_{filename}_{uuid.uuid4().hex[:8]}"
Pros:
- Zero breaking changes
- Immediate clarity improvement
- No migration needed
Cons:
- Doesn't fix underlying naming confusion
- Relies on developers reading comments
Option 2: Add Metadata Fields (Recommended)
Effort: Low-Medium | Risk: Low | Breaking: No
Leverage existing type field and add clarifying metadata:
# In sources table, clarify with metadata
{
"source_id": "...",
"type": "crawled", # or "uploaded"
"id_strategy": "hash", # or "uuid"
"allows_duplicates": false # or true
}
Pros:
- Makes behavior explicit in data
- No schema changes needed
- Backward compatible
Cons:
- Still uses same field name
- Requires updating insert logic
Option 3: Separate ID Fields (Clean Architecture)
Effort: High | Risk: Medium | Breaking: Yes
Rename fields to match their purpose:
-- Sources table
content_id VARCHAR(255) PRIMARY KEY, -- Unique identifier
content_hash VARCHAR(64), -- For deduplication (nullable)
source_url TEXT, -- Original URL (for crawled)
source_path TEXT, -- Original path (for uploaded)
Pros:
- Clear semantic separation
- Self-documenting schema
- Enables hybrid strategies
Cons:
- Database migration required
- API contract changes
- Frontend updates needed
Option 4: Unified Strategy with Flags (Future-Proof)
Effort: Medium | Risk: Low | Breaking: Potentially
Use consistent ID generation with explicit control:
def generate_source_id(content: str, strategy: Literal["deterministic", "unique"]):
if strategy == "deterministic":
return hashlib.sha256(content.encode()).hexdigest()[:16]
else:
return f"{content_prefix}_{uuid.uuid4().hex[:8]}"
Pros:
- Consistent interface
- Explicit strategy selection
- Easier to test and reason about
Cons:
- Requires refactoring existing code
- May need migration for consistency
Recommendation
Short term (now): Option 1 - Add documentation Medium term (next sprint): Option 2 - Add metadata fields Long term (v2): Option 3 - Clean architecture with proper field names
This staged approach provides immediate clarity while planning for proper architectural improvements.
Related Code Locations
- URL ID generation:
python/src/server/services/crawling/helpers/url_handler.py:186-240 - File ID generation:
python/src/server/api_routes/knowledge_api.py:599 - Sources table:
sourcesin Supabase - Frontend usage:
archon-ui-main/src/services/knowledgeBaseService.ts
Additional Context
This issue was discovered while fixing a collision bug in file uploads (commit 1ed2d88) where timestamp-based IDs were replaced with UUIDs. The fix highlighted the conceptual difference between the two ID strategies.