Clarify source_id naming and generation strategy differences between URLs and files

Open Wirasm opened this issue 3 months ago • 0 comments

Problem

The codebase uses source_id for both URL-based crawls and file uploads, but with fundamentally different generation strategies and purposes:

Current Implementation

URL source_id:

Location: python/src/server/services/crawling/helpers/url_handler.py:186-240
Method: Deterministic SHA256 hash of canonical URL (16 chars)
Purpose: Deduplication - same URL always generates same ID
Example: https://example.com/page → a3f5b8c9d2e1f0a7

File source_id:

Location: python/src/server/api_routes/knowledge_api.py:599
Method: Random UUID suffix (8 chars) - previously used timestamp
Purpose: Versioning - allow multiple uploads of same file
Example: document.pdf → file_document_pdf_6a1d3948

This naming overlap creates conceptual confusion since they serve opposite purposes (deduplication vs versioning).

Impact

Developer confusion: Same field name suggests similar behavior
Maintenance risk: Future developers might incorrectly assume unified behavior
API inconsistency: Different ID patterns for same field

Solution Options

Option 1: Documentation Only (Pragmatic)

Effort: Low | Risk: None | Breaking: No

Add clear comments explaining the different strategies:

# URL sources: Deterministic hash for deduplication
# Same URL → Same ID (prevents duplicate crawls)
source_id = UrlHandler.generate_unique_source_id(url)

# File sources: Random UUID for versioning  
# Same file → Different IDs (allows re-uploads)
source_id = f"file_{filename}_{uuid.uuid4().hex[:8]}"

Pros:

Zero breaking changes
Immediate clarity improvement
No migration needed

Cons:

Doesn't fix underlying naming confusion
Relies on developers reading comments

Option 2: Add Metadata Fields (Recommended)

Effort: Low-Medium | Risk: Low | Breaking: No

Leverage existing type field and add clarifying metadata:

# In sources table, clarify with metadata
{
  "source_id": "...",
  "type": "crawled",  # or "uploaded"
  "id_strategy": "hash",  # or "uuid"
  "allows_duplicates": false  # or true
}

Pros:

Makes behavior explicit in data
No schema changes needed
Backward compatible

Cons:

Still uses same field name
Requires updating insert logic

Option 3: Separate ID Fields (Clean Architecture)

Effort: High | Risk: Medium | Breaking: Yes

Rename fields to match their purpose:

-- Sources table
content_id VARCHAR(255) PRIMARY KEY,  -- Unique identifier
content_hash VARCHAR(64),  -- For deduplication (nullable)
source_url TEXT,  -- Original URL (for crawled)
source_path TEXT,  -- Original path (for uploaded)

Pros:

Clear semantic separation
Self-documenting schema
Enables hybrid strategies

Cons:

Database migration required
API contract changes
Frontend updates needed

Option 4: Unified Strategy with Flags (Future-Proof)

Effort: Medium | Risk: Low | Breaking: Potentially

Use consistent ID generation with explicit control:

def generate_source_id(content: str, strategy: Literal["deterministic", "unique"]):
    if strategy == "deterministic":
        return hashlib.sha256(content.encode()).hexdigest()[:16]
    else:
        return f"{content_prefix}_{uuid.uuid4().hex[:8]}"

Pros:

Consistent interface
Explicit strategy selection
Easier to test and reason about

Cons:

Requires refactoring existing code
May need migration for consistency

Recommendation

Short term (now): Option 1 - Add documentation Medium term (next sprint): Option 2 - Add metadata fields Long term (v2): Option 3 - Clean architecture with proper field names

This staged approach provides immediate clarity while planning for proper architectural improvements.

Related Code Locations

URL ID generation: python/src/server/services/crawling/helpers/url_handler.py:186-240
File ID generation: python/src/server/api_routes/knowledge_api.py:599
Sources table: sources in Supabase
Frontend usage: archon-ui-main/src/services/knowledgeBaseService.ts

Additional Context

This issue was discovered while fixing a collision bug in file uploads (commit 1ed2d88) where timestamp-based IDs were replaced with UUIDs. The fix highlighted the conceptual difference between the two ID strategies.

Aug 29 '25 17:08 Wirasm