dwata
dwata copied to clipboard
Fake Data Generator for Email Client Database
Fake Data Generator for Email Client Database
Overview
Create a fake data generator to populate a SQLite3 database with realistic email data simulating 4-5 email accounts with thousands of emails and related data points.
Database Location
The SQLite database will be stored in the user's OS configuration directory:
- Linux/macOS:
~/.config/dwata/db.sqlite3 - Windows:
%APPDATA%\dwata\db.sqlite3
Example path: /home/username/.config/dwata/db.sqlite3
Directory Creation
- The generator must create the
dwatadirectory if it doesn't exist - Use Rust's
std::fs::create_dir_all()ordirscrate to get platform-specific config directory - Ensure proper error handling if directory creation fails (permissions, disk space, etc.)
Requirements
Data Volume
- Accounts: 4-5 email accounts (e.g.,
[email protected],[email protected], etc.) - Emails: ~2,000 emails total across all accounts
- Folders: Standard folders (Inbox, Sent, Drafts, Trash, Archive) + custom folders per account
- Labels: ~20-30 labels across accounts
- Contacts: ~200-500 contacts with realistic data
- Attachments: Metadata only (filename, size, mime-type, hash) - no actual file content
Database Schema
Tables Structure
accounts
CREATE TABLE accounts (
id INTEGER PRIMARY KEY,
email TEXT NOT NULL UNIQUE,
display_name TEXT,
provider TEXT,
created_at INTEGER
);
provider: gmail, yahoo, outlook, etc.created_at: unix timestamp
folders
CREATE TABLE folders (
id INTEGER PRIMARY KEY,
account_id INTEGER NOT NULL,
name TEXT NOT NULL,
folder_type TEXT,
parent_folder_id INTEGER,
FOREIGN KEY (account_id) REFERENCES accounts(id),
FOREIGN KEY (parent_folder_id) REFERENCES folders(id),
UNIQUE(account_id, name)
);
folder_type: inbox, sent, drafts, trash, archive, customparent_folder_id: nullable, for nested folders
labels
CREATE TABLE labels (
id INTEGER PRIMARY KEY,
account_id INTEGER NOT NULL,
name TEXT NOT NULL,
color TEXT,
FOREIGN KEY (account_id) REFERENCES accounts(id),
UNIQUE(account_id, name)
);
color: hex color code (e.g., #FF5733)
contacts
CREATE TABLE contacts (
id INTEGER PRIMARY KEY,
email TEXT NOT NULL,
display_name TEXT,
first_name TEXT,
last_name TEXT,
account_id INTEGER,
created_at INTEGER,
updated_at INTEGER,
FOREIGN KEY (account_id) REFERENCES accounts(id),
UNIQUE(account_id, email)
);
account_id: which account owns this contact
emails
CREATE TABLE emails (
id INTEGER PRIMARY KEY,
account_id INTEGER NOT NULL,
folder_id INTEGER NOT NULL,
message_id TEXT UNIQUE,
subject TEXT NOT NULL,
from_email TEXT NOT NULL,
from_name TEXT,
to_emails TEXT,
cc_emails TEXT,
bcc_emails TEXT,
body_plain TEXT,
body_html TEXT,
is_read INTEGER,
is_starred INTEGER,
is_draft INTEGER,
received_at INTEGER,
sent_at INTEGER,
created_at INTEGER,
FOREIGN KEY (account_id) REFERENCES accounts(id),
FOREIGN KEY (folder_id) REFERENCES folders(id)
);
CREATE INDEX idx_emails_account_folder_received ON emails(account_id, folder_id, received_at);
CREATE INDEX idx_emails_subject ON emails(subject);
message_id: RFC 5322 Message-ID formatto_emails,cc_emails,bcc_emails: JSON arraysis_read,is_starred,is_draft: boolean (0/1)received_at,sent_at,created_at: unix timestamps
email_labels
CREATE TABLE email_labels (
email_id INTEGER NOT NULL,
label_id INTEGER NOT NULL,
FOREIGN KEY (email_id) REFERENCES emails(id),
FOREIGN KEY (label_id) REFERENCES labels(id),
PRIMARY KEY (email_id, label_id)
);
- Junction table for many-to-many relationship
attachments
CREATE TABLE attachments (
id INTEGER PRIMARY KEY,
email_id INTEGER NOT NULL,
filename TEXT NOT NULL,
mime_type TEXT,
size_bytes INTEGER,
content_hash TEXT,
created_at INTEGER,
FOREIGN KEY (email_id) REFERENCES emails(id)
);
CREATE INDEX idx_attachments_email ON attachments(email_id);
content_hash: SHA256 hash (simulated)mime_type: e.g., application/pdf, image/jpeg
file_index
CREATE TABLE file_index (
id INTEGER PRIMARY KEY,
attachment_id INTEGER UNIQUE,
indexed_at INTEGER,
search_tokens TEXT,
FOREIGN KEY (attachment_id) REFERENCES attachments(id)
);
search_tokens: tokenized filename for search functionality- Separate indexing for all attachments
Data Generation Strategy
Realistic Data Patterns
-
Temporal Distribution: Emails spread over last 6-12 months
- More recent emails (last 30 days: ~30% of emails)
- Gradual decrease in older periods
-
Email Distribution per Account:
- Accounts should have varying activity levels
- Some accounts busier than others (realistic usage)
-
Folder Distribution:
- Inbox: 40-50% of emails
- Sent: 20-30%
- Other folders: remaining
-
Conversation Threads:
- Group related emails using subject patterns (Re: Fwd:)
- Simulate email chains with reply timestamps
-
Attachments:
- ~30% of emails have attachments
- 1-3 attachments per email (when present)
- Realistic file types: .pdf, .docx, .xlsx, .jpg, .png, .zip
- File sizes: 10KB - 10MB range
-
Labels:
- 2-5 labels per email (for ~40% of emails)
- Common labels: Work, Personal, Important, Follow-up, etc.
Data Sources
-
Use Rust faker libraries (e.g.,
fakecrate) for:- Names (first, last, full)
- Email addresses
- Email subjects (sentences, phrases)
- Email bodies (paragraphs)
- Companies/organizations
- File names
-
Generate realistic patterns:
- Business hours for sent emails (9 AM - 6 PM weighted)
- Weekend vs weekday patterns
- Common email subject patterns
Implementation Requirements
Code Structure
src/
├── db/
│ ├── mod.rs
│ ├── schema.rs # SQL schema definitions and table creation
│ ├── models.rs # Rust structs for each entity
│ └── fake_data.rs # Data generation logic
└── main.rs # CLI to run generator
Database Path Helper
use std::path::PathBuf;
fn get_db_path() -> Result<PathBuf, Box<dyn std::error::Error>> {
// Use dirs crate or similar for cross-platform config dir
let config_dir = dirs::config_dir()
.ok_or("Could not determine config directory")?;
let dwata_dir = config_dir.join("dwata");
// Create directory if it doesn't exist
std::fs::create_dir_all(&dwata_dir)?;
Ok(dwata_dir.join("db.sqlite3"))
}
Rust Structs (models.rs)
pub struct Account {
pub id: Option<i64>,
pub email: String,
pub display_name: String,
pub provider: String,
pub created_at: i64,
}
pub struct Email {
pub id: Option<i64>,
pub account_id: i64,
pub folder_id: i64,
pub message_id: String,
pub subject: String,
pub from_email: String,
pub from_name: Option<String>,
pub to_emails: Vec<String>,
pub cc_emails: Vec<String>,
pub bcc_emails: Vec<String>,
pub body_plain: String,
pub body_html: Option<String>,
pub is_read: bool,
pub is_starred: bool,
pub is_draft: bool,
pub received_at: i64,
pub sent_at: Option<i64>,
pub created_at: i64,
}
// Similar structs for Folder, Label, Contact, Attachment, etc.
Generator Features
-
CLI interface to run data generation
cargo run --bin fake-data-generator -- --accounts 5 --emails 2000 -
Configurable parameters:
- Number of accounts
- Number of emails per account
- Date range for emails
- Custom database path (optional, defaults to config dir)
-
Idempotent: Can safely re-run (drops and recreates tables)
-
Progress reporting:
- Show progress during generation
- Display database path being used
- Report statistics when complete (total emails, attachments, etc.)
Database Setup
- Use
rusqlitecrate for SQLite operations - Create all tables with proper schema
- Enable foreign key constraints
- Create indexes after data insertion (for performance)
Deliverables
- [ ] Complete SQL schema implementation
- [ ] Rust data models (structs) for all entities
- [ ] Platform-specific config directory path handling
- [ ] Directory creation logic with error handling
- [ ] Fake data generator with realistic patterns
- [ ] CLI tool to run generator with configurable options
- [ ] Generated SQLite database file at
~/.config/dwata/db.sqlite3 - [ ] README documentation:
- How to run the generator
- Where database is stored
- Database schema documentation
- Sample queries to verify data
Testing & Verification
Sample Queries
-- Verify account count
SELECT COUNT(*) FROM accounts;
-- Verify email distribution by account
SELECT a.email, COUNT(e.id) as email_count
FROM accounts a
LEFT JOIN emails e ON a.id = e.account_id
GROUP BY a.id;
-- Verify attachments
SELECT COUNT(*) FROM attachments;
-- Check email date range
SELECT
datetime(MIN(received_at), 'unixepoch') as earliest,
datetime(MAX(received_at), 'unixepoch') as latest
FROM emails;
-- Verify labels
SELECT l.name, COUNT(el.email_id) as email_count
FROM labels l
LEFT JOIN email_labels el ON l.id = el.label_id
GROUP BY l.id
ORDER BY email_count DESC
LIMIT 10;
Success Criteria
- [ ] Config directory created at correct platform-specific location
- [ ] Database file created at
~/.config/dwata/db.sqlite3(Linux/macOS) - [ ] Database created with all tables and constraints
- [ ] 4-5 accounts with realistic email addresses
- [ ] ~2,000 emails with proper temporal distribution
- [ ] 200-500 contacts
- [ ] ~600 attachments (30% of emails)
- [ ] All foreign key relationships valid
- [ ] Indexes created for performance
- [ ] Generator completes in <30 seconds
- [ ] Database file size reasonable (<200MB)
- [ ] Proper error handling for directory/file creation failures
Technical Stack
- Language: Rust (latest stable)
- Database: SQLite3
- Crates:
rusqlite- SQLite interfacefakeorfaker_rand- Fake data generationchrono- Date/time handlingserde_json- JSON serialization for email arraysrand- Random number generationclap- CLI argument parsingdirs- Platform-specific directory paths