add database feature

Open prajeeta15 opened this issue 2 months ago • 0 comments

Issue #315

Changes Proposed

New Database Format for --save Flag Added database as a new choice for the --save argument Location: main.py lines 138-139
Core Database Module Created src/torbot/modules/database.py Implements SearchResultsDatabase class for SQLite management No external database server required (uses built-in sqlite3)
Integration with LinkTree Added saveDatabase() method in src/torbot/modules/linktree.py (lines 159-195) Extracts all discovered links and metadata for persistent storage
Query Utilities Created src/torbot/modules/db_query.py for result retrieval Created scripts/query_database.py CLI for database operations

Explanation of Changes

Database Engine & Architecture SQLite (file-based, no server)<project_root>/torbot_search_results.db Auto-initialized on first use

Database Schema searches Table (Search Metadata)

- id (INTEGER PRIMARY KEY): Auto-incrementing search ID
- root_url (TEXT): The root URL that was crawled
- search_timestamp (DATETIME): ISO 8601 formatted timestamp of search
- depth (INTEGER): Crawl depth setting used
- total_links (INTEGER): Count of total links discovered
- links_data (TEXT): JSON array of all link metadata
- created_at (DATETIME): Record creation timestamp

links Table (Individual Link Records)

- id (INTEGER PRIMARY KEY): Auto-incrementing link ID
- search_id (INTEGER): Foreign key referencing searches table
- url (TEXT): Full URL of discovered link
- title (TEXT): Page title or hostname
- status_code (INTEGER): HTTP response code (200, 404, etc.)
- classification (TEXT): Content classification from NLP module
- accuracy (REAL): Classification confidence score (0.0-1.0)
- emails (TEXT): JSON array of emails found on page
- phone_numbers (TEXT): JSON array of phone numbers found

Relationship: One search has many links (1:N relationship with CASCADE delete)

Metadata Captured Per Search

Root-Level Metadata: ✅ Root URL being crawled ✅ Exact timestamp of search (ISO 8601) ✅ Crawl depth configuration ✅ Total link count

Per-Link Metadata: ✅ Full URL ✅ Page title ✅ HTTP status code (connectivity indicator) ✅ Content classification (marketplace, forum, etc.) ✅ Classification accuracy/confidence ✅ Email addresses extracted ✅ Phone numbers extracted

Core Features:

Save Results -> searchResultsDatabase.save_search_results()->Stores search + links
Retrieve History -> get_search_history() -> Query with optional URL filter
Get Details -> get_search_by_id() - Full search details with all links
Close Connection -> close() -> Proper resource cleanup

Usage Basic Save:

python main.py -u http://example.onion --depth 2 --save database

Benefits:

Persistence: Search results survive program restarts
Auditability: Full timestamp history of all crawls
Queryability: Filter and search previous results
Scalability: SQLite handles thousands of records efficiently
No Dependencies: Uses Python's built-in sqlite3 module
Relationship Integrity: Foreign keys prevent orphaned records
Export Ready: JSON data format enables easy integration with other tools

Oct 28 '25 16:10 prajeeta15