Archon icon indicating copy to clipboard operation
Archon copied to clipboard

Clarify source_id naming and generation strategy differences between URLs and files

Open Wirasm opened this issue 3 months ago • 0 comments

Problem

The codebase uses source_id for both URL-based crawls and file uploads, but with fundamentally different generation strategies and purposes:

Current Implementation

URL source_id:

  • Location: python/src/server/services/crawling/helpers/url_handler.py:186-240
  • Method: Deterministic SHA256 hash of canonical URL (16 chars)
  • Purpose: Deduplication - same URL always generates same ID
  • Example: https://example.com/pagea3f5b8c9d2e1f0a7

File source_id:

  • Location: python/src/server/api_routes/knowledge_api.py:599
  • Method: Random UUID suffix (8 chars) - previously used timestamp
  • Purpose: Versioning - allow multiple uploads of same file
  • Example: document.pdffile_document_pdf_6a1d3948

This naming overlap creates conceptual confusion since they serve opposite purposes (deduplication vs versioning).

Impact

  • Developer confusion: Same field name suggests similar behavior
  • Maintenance risk: Future developers might incorrectly assume unified behavior
  • API inconsistency: Different ID patterns for same field

Solution Options

Option 1: Documentation Only (Pragmatic)

Effort: Low | Risk: None | Breaking: No

Add clear comments explaining the different strategies:

# URL sources: Deterministic hash for deduplication
# Same URL → Same ID (prevents duplicate crawls)
source_id = UrlHandler.generate_unique_source_id(url)

# File sources: Random UUID for versioning  
# Same file → Different IDs (allows re-uploads)
source_id = f"file_{filename}_{uuid.uuid4().hex[:8]}"

Pros:

  • Zero breaking changes
  • Immediate clarity improvement
  • No migration needed

Cons:

  • Doesn't fix underlying naming confusion
  • Relies on developers reading comments

Option 2: Add Metadata Fields (Recommended)

Effort: Low-Medium | Risk: Low | Breaking: No

Leverage existing type field and add clarifying metadata:

# In sources table, clarify with metadata
{
  "source_id": "...",
  "type": "crawled",  # or "uploaded"
  "id_strategy": "hash",  # or "uuid"
  "allows_duplicates": false  # or true
}

Pros:

  • Makes behavior explicit in data
  • No schema changes needed
  • Backward compatible

Cons:

  • Still uses same field name
  • Requires updating insert logic

Option 3: Separate ID Fields (Clean Architecture)

Effort: High | Risk: Medium | Breaking: Yes

Rename fields to match their purpose:

-- Sources table
content_id VARCHAR(255) PRIMARY KEY,  -- Unique identifier
content_hash VARCHAR(64),  -- For deduplication (nullable)
source_url TEXT,  -- Original URL (for crawled)
source_path TEXT,  -- Original path (for uploaded)

Pros:

  • Clear semantic separation
  • Self-documenting schema
  • Enables hybrid strategies

Cons:

  • Database migration required
  • API contract changes
  • Frontend updates needed

Option 4: Unified Strategy with Flags (Future-Proof)

Effort: Medium | Risk: Low | Breaking: Potentially

Use consistent ID generation with explicit control:

def generate_source_id(content: str, strategy: Literal["deterministic", "unique"]):
    if strategy == "deterministic":
        return hashlib.sha256(content.encode()).hexdigest()[:16]
    else:
        return f"{content_prefix}_{uuid.uuid4().hex[:8]}"

Pros:

  • Consistent interface
  • Explicit strategy selection
  • Easier to test and reason about

Cons:

  • Requires refactoring existing code
  • May need migration for consistency

Recommendation

Short term (now): Option 1 - Add documentation Medium term (next sprint): Option 2 - Add metadata fields Long term (v2): Option 3 - Clean architecture with proper field names

This staged approach provides immediate clarity while planning for proper architectural improvements.

Related Code Locations

  • URL ID generation: python/src/server/services/crawling/helpers/url_handler.py:186-240
  • File ID generation: python/src/server/api_routes/knowledge_api.py:599
  • Sources table: sources in Supabase
  • Frontend usage: archon-ui-main/src/services/knowledgeBaseService.ts

Additional Context

This issue was discovered while fixing a collision bug in file uploads (commit 1ed2d88) where timestamp-based IDs were replaced with UUIDs. The fix highlighted the conceptual difference between the two ID strategies.

Wirasm avatar Aug 29 '25 17:08 Wirasm