Stirling-PDF icon indicating copy to clipboard operation
Stirling-PDF copied to clipboard

Feature/v2/filehistory

Open ConnorYoh opened this issue 3 months ago • 1 comments

Stirling PDF File History Specification

Overview

Stirling PDF implements a comprehensive file history tracking system that embeds metadata directly into PDF documents using the PDF keywords field. This system tracks tool operations, version progression, and file lineage through the processing pipeline.

PDF Metadata Format

Storage Mechanism

File history is stored in the PDF Keywords field as a JSON string with the prefix stirling-history:.

Metadata Structure

interface PDFHistoryMetadata {
  stirlingHistory: {
    originalFileId: string;        // UUID of the root file in the version chain
    parentFileId?: string;         // UUID of the immediate parent file  
    versionNumber: number;         // Version number (1, 2, 3, etc.)
    toolChain: ToolOperation[];    // Array of applied tool operations
    formatVersion: '1.0';          // Metadata format version
  };
}

interface ToolOperation {
  toolName: string;                // Tool identifier (e.g., 'compress', 'sanitize')
  timestamp: number;               // When the tool was applied
  parameters?: Record<string, any>; // Tool-specific parameters (optional)
}

Standard PDF Metadata Fields Used

The system uses industry-standard PDF document information fields:

  • Creator: Set to "Stirling-PDF" (identifies the application)
  • Producer: Set to "Stirling-PDF" (identifies the PDF library/processor)
  • Title, Author, Subject, CreationDate: Automatically preserved by pdf-lib during processing
  • Keywords: Enhanced with Stirling history data while preserving user keywords

Date Handling Strategy:

  • PDF CreationDate: Preserved automatically (document creation date)
  • File.lastModified: Source of truth for "when file was last changed" (original upload time or tool processing time)
  • No duplication: Single timestamp approach using File.lastModified for all UI displays

Example PDF Document Information

PDF Document Info:
  Title: "User Document Title" (preserved from original)
  Author: "Document Author" (preserved from original)
  Creator: "Stirling-PDF"
  Producer: "Stirling-PDF"  
  CreationDate: "2025-01-01T10:30:00Z" (preserved from original)
  Keywords: ["user-keyword", "stirling-history:{\"stirlingHistory\":{\"originalFileId\":\"abc123\",\"versionNumber\":2,\"toolChain\":[{\"toolName\":\"compress\",\"timestamp\":1756825614618},{\"toolName\":\"sanitize\",\"timestamp\":1756825631545}],\"formatVersion\":\"1.0\"}}"]

File System:
  lastModified: 1756825631545 (tool processing time - source of truth for "when file was last changed")

Version Numbering System

Version Progression

  • v0: Original uploaded file (no Stirling PDF processing)
  • v1: First tool applied to original file
  • v2: Second tool applied (inherits from v1)
  • v3: Third tool applied (inherits from v2)
  • etc.

Version Relationships

document.pdf (v0) 
    ↓ compress
document.pdf (v1: compress)
    ↓ sanitize  
document.pdf (v2: compress → sanitize)
    ↓ ocr
document.pdf (v3: compress → sanitize → ocr)

File Lineage Tracking

Original File ID

The originalFileId remains constant throughout the entire version chain, enabling grouping of all versions of the same logical document.

Parent-Child Relationships

Each processed file references its immediate parent via parentFileId, creating a complete audit trail.

Tool Chain

The toolChain array maintains the complete sequence of tool operations applied to reach the current version.

ConnorYoh avatar Sep 03 '25 16:09 ConnorYoh