dwata icon indicating copy to clipboard operation
dwata copied to clipboard

Fake Data Generator for Email Client Database

Open brainless opened this issue 3 months ago • 0 comments

Fake Data Generator for Email Client Database

Overview

Create a fake data generator to populate a SQLite3 database with realistic email data simulating 4-5 email accounts with thousands of emails and related data points.


Database Location

The SQLite database will be stored in the user's OS configuration directory:

  • Linux/macOS: ~/.config/dwata/db.sqlite3
  • Windows: %APPDATA%\dwata\db.sqlite3

Example path: /home/username/.config/dwata/db.sqlite3

Directory Creation

  • The generator must create the dwata directory if it doesn't exist
  • Use Rust's std::fs::create_dir_all() or dirs crate to get platform-specific config directory
  • Ensure proper error handling if directory creation fails (permissions, disk space, etc.)

Requirements

Data Volume

  • Accounts: 4-5 email accounts (e.g., [email protected], [email protected], etc.)
  • Emails: ~2,000 emails total across all accounts
  • Folders: Standard folders (Inbox, Sent, Drafts, Trash, Archive) + custom folders per account
  • Labels: ~20-30 labels across accounts
  • Contacts: ~200-500 contacts with realistic data
  • Attachments: Metadata only (filename, size, mime-type, hash) - no actual file content

Database Schema

Tables Structure

accounts

CREATE TABLE accounts (
    id INTEGER PRIMARY KEY,
    email TEXT NOT NULL UNIQUE,
    display_name TEXT,
    provider TEXT,
    created_at INTEGER
);
  • provider: gmail, yahoo, outlook, etc.
  • created_at: unix timestamp

folders

CREATE TABLE folders (
    id INTEGER PRIMARY KEY,
    account_id INTEGER NOT NULL,
    name TEXT NOT NULL,
    folder_type TEXT,
    parent_folder_id INTEGER,
    FOREIGN KEY (account_id) REFERENCES accounts(id),
    FOREIGN KEY (parent_folder_id) REFERENCES folders(id),
    UNIQUE(account_id, name)
);
  • folder_type: inbox, sent, drafts, trash, archive, custom
  • parent_folder_id: nullable, for nested folders

labels

CREATE TABLE labels (
    id INTEGER PRIMARY KEY,
    account_id INTEGER NOT NULL,
    name TEXT NOT NULL,
    color TEXT,
    FOREIGN KEY (account_id) REFERENCES accounts(id),
    UNIQUE(account_id, name)
);
  • color: hex color code (e.g., #FF5733)

contacts

CREATE TABLE contacts (
    id INTEGER PRIMARY KEY,
    email TEXT NOT NULL,
    display_name TEXT,
    first_name TEXT,
    last_name TEXT,
    account_id INTEGER,
    created_at INTEGER,
    updated_at INTEGER,
    FOREIGN KEY (account_id) REFERENCES accounts(id),
    UNIQUE(account_id, email)
);
  • account_id: which account owns this contact

emails

CREATE TABLE emails (
    id INTEGER PRIMARY KEY,
    account_id INTEGER NOT NULL,
    folder_id INTEGER NOT NULL,
    message_id TEXT UNIQUE,
    subject TEXT NOT NULL,
    from_email TEXT NOT NULL,
    from_name TEXT,
    to_emails TEXT,
    cc_emails TEXT,
    bcc_emails TEXT,
    body_plain TEXT,
    body_html TEXT,
    is_read INTEGER,
    is_starred INTEGER,
    is_draft INTEGER,
    received_at INTEGER,
    sent_at INTEGER,
    created_at INTEGER,
    FOREIGN KEY (account_id) REFERENCES accounts(id),
    FOREIGN KEY (folder_id) REFERENCES folders(id)
);

CREATE INDEX idx_emails_account_folder_received ON emails(account_id, folder_id, received_at);
CREATE INDEX idx_emails_subject ON emails(subject);
  • message_id: RFC 5322 Message-ID format
  • to_emails, cc_emails, bcc_emails: JSON arrays
  • is_read, is_starred, is_draft: boolean (0/1)
  • received_at, sent_at, created_at: unix timestamps

email_labels

CREATE TABLE email_labels (
    email_id INTEGER NOT NULL,
    label_id INTEGER NOT NULL,
    FOREIGN KEY (email_id) REFERENCES emails(id),
    FOREIGN KEY (label_id) REFERENCES labels(id),
    PRIMARY KEY (email_id, label_id)
);
  • Junction table for many-to-many relationship

attachments

CREATE TABLE attachments (
    id INTEGER PRIMARY KEY,
    email_id INTEGER NOT NULL,
    filename TEXT NOT NULL,
    mime_type TEXT,
    size_bytes INTEGER,
    content_hash TEXT,
    created_at INTEGER,
    FOREIGN KEY (email_id) REFERENCES emails(id)
);

CREATE INDEX idx_attachments_email ON attachments(email_id);
  • content_hash: SHA256 hash (simulated)
  • mime_type: e.g., application/pdf, image/jpeg

file_index

CREATE TABLE file_index (
    id INTEGER PRIMARY KEY,
    attachment_id INTEGER UNIQUE,
    indexed_at INTEGER,
    search_tokens TEXT,
    FOREIGN KEY (attachment_id) REFERENCES attachments(id)
);
  • search_tokens: tokenized filename for search functionality
  • Separate indexing for all attachments

Data Generation Strategy

Realistic Data Patterns

  • Temporal Distribution: Emails spread over last 6-12 months

    • More recent emails (last 30 days: ~30% of emails)
    • Gradual decrease in older periods
  • Email Distribution per Account:

    • Accounts should have varying activity levels
    • Some accounts busier than others (realistic usage)
  • Folder Distribution:

    • Inbox: 40-50% of emails
    • Sent: 20-30%
    • Other folders: remaining
  • Conversation Threads:

    • Group related emails using subject patterns (Re: Fwd:)
    • Simulate email chains with reply timestamps
  • Attachments:

    • ~30% of emails have attachments
    • 1-3 attachments per email (when present)
    • Realistic file types: .pdf, .docx, .xlsx, .jpg, .png, .zip
    • File sizes: 10KB - 10MB range
  • Labels:

    • 2-5 labels per email (for ~40% of emails)
    • Common labels: Work, Personal, Important, Follow-up, etc.

Data Sources

  • Use Rust faker libraries (e.g., fake crate) for:

    • Names (first, last, full)
    • Email addresses
    • Email subjects (sentences, phrases)
    • Email bodies (paragraphs)
    • Companies/organizations
    • File names
  • Generate realistic patterns:

    • Business hours for sent emails (9 AM - 6 PM weighted)
    • Weekend vs weekday patterns
    • Common email subject patterns

Implementation Requirements

Code Structure

src/
├── db/
│   ├── mod.rs
│   ├── schema.rs      # SQL schema definitions and table creation
│   ├── models.rs      # Rust structs for each entity
│   └── fake_data.rs   # Data generation logic
└── main.rs            # CLI to run generator

Database Path Helper

use std::path::PathBuf;

fn get_db_path() -> Result<PathBuf, Box<dyn std::error::Error>> {
    // Use dirs crate or similar for cross-platform config dir
    let config_dir = dirs::config_dir()
        .ok_or("Could not determine config directory")?;
    
    let dwata_dir = config_dir.join("dwata");
    
    // Create directory if it doesn't exist
    std::fs::create_dir_all(&dwata_dir)?;
    
    Ok(dwata_dir.join("db.sqlite3"))
}

Rust Structs (models.rs)

pub struct Account {
    pub id: Option<i64>,
    pub email: String,
    pub display_name: String,
    pub provider: String,
    pub created_at: i64,
}

pub struct Email {
    pub id: Option<i64>,
    pub account_id: i64,
    pub folder_id: i64,
    pub message_id: String,
    pub subject: String,
    pub from_email: String,
    pub from_name: Option<String>,
    pub to_emails: Vec<String>,
    pub cc_emails: Vec<String>,
    pub bcc_emails: Vec<String>,
    pub body_plain: String,
    pub body_html: Option<String>,
    pub is_read: bool,
    pub is_starred: bool,
    pub is_draft: bool,
    pub received_at: i64,
    pub sent_at: Option<i64>,
    pub created_at: i64,
}

// Similar structs for Folder, Label, Contact, Attachment, etc.

Generator Features

  • CLI interface to run data generation

    cargo run --bin fake-data-generator -- --accounts 5 --emails 2000
    
  • Configurable parameters:

    • Number of accounts
    • Number of emails per account
    • Date range for emails
    • Custom database path (optional, defaults to config dir)
  • Idempotent: Can safely re-run (drops and recreates tables)

  • Progress reporting:

    • Show progress during generation
    • Display database path being used
    • Report statistics when complete (total emails, attachments, etc.)

Database Setup

  • Use rusqlite crate for SQLite operations
  • Create all tables with proper schema
  • Enable foreign key constraints
  • Create indexes after data insertion (for performance)

Deliverables

  • [ ] Complete SQL schema implementation
  • [ ] Rust data models (structs) for all entities
  • [ ] Platform-specific config directory path handling
  • [ ] Directory creation logic with error handling
  • [ ] Fake data generator with realistic patterns
  • [ ] CLI tool to run generator with configurable options
  • [ ] Generated SQLite database file at ~/.config/dwata/db.sqlite3
  • [ ] README documentation:
    • How to run the generator
    • Where database is stored
    • Database schema documentation
    • Sample queries to verify data

Testing & Verification

Sample Queries

-- Verify account count
SELECT COUNT(*) FROM accounts;

-- Verify email distribution by account
SELECT a.email, COUNT(e.id) as email_count 
FROM accounts a 
LEFT JOIN emails e ON a.id = e.account_id 
GROUP BY a.id;

-- Verify attachments
SELECT COUNT(*) FROM attachments;

-- Check email date range
SELECT 
    datetime(MIN(received_at), 'unixepoch') as earliest,
    datetime(MAX(received_at), 'unixepoch') as latest 
FROM emails;

-- Verify labels
SELECT l.name, COUNT(el.email_id) as email_count 
FROM labels l 
LEFT JOIN email_labels el ON l.id = el.label_id 
GROUP BY l.id 
ORDER BY email_count DESC 
LIMIT 10;

Success Criteria

  • [ ] Config directory created at correct platform-specific location
  • [ ] Database file created at ~/.config/dwata/db.sqlite3 (Linux/macOS)
  • [ ] Database created with all tables and constraints
  • [ ] 4-5 accounts with realistic email addresses
  • [ ] ~2,000 emails with proper temporal distribution
  • [ ] 200-500 contacts
  • [ ] ~600 attachments (30% of emails)
  • [ ] All foreign key relationships valid
  • [ ] Indexes created for performance
  • [ ] Generator completes in <30 seconds
  • [ ] Database file size reasonable (<200MB)
  • [ ] Proper error handling for directory/file creation failures

Technical Stack

  • Language: Rust (latest stable)
  • Database: SQLite3
  • Crates:
    • rusqlite - SQLite interface
    • fake or faker_rand - Fake data generation
    • chrono - Date/time handling
    • serde_json - JSON serialization for email arrays
    • rand - Random number generation
    • clap - CLI argument parsing
    • dirs - Platform-specific directory paths

brainless avatar Oct 11 '25 03:10 brainless