strfry icon indicating copy to clipboard operation
strfry copied to clipboard

feat: add NIP-50 support

Open dskvr opened this issue 2 months ago • 0 comments

Overview

This PR implements NIP-50 (Search Capability) for strfry, enabling full-text search across Nostr events using BM25 ranking. The implementation includes:

  • Full-text search with relevance ranking (BM25 algorithm)
  • Configurable search backends (LMDB, Noop)
  • Background indexer with catch-up mechanism
  • Production-ready performance optimizations
  • benchmark suite*

Architecture

Core Components

Search Provider Interface (src/search/SearchProvider.h)

  • Abstract interface allowing pluggable search backends
  • Supports index creation, document insertion, and search queries

LMDB Search Backend (src/search/LmdbSearchProvider.h)

  • Inverted index stored in LMDB tables
  • Token-based posting lists with term frequency data
  • Document metadata for BM25 scoring (document length, kind)
  • Efficient packed binary format for postings

Background Indexer (in LmdbSearchProvider::runCatchupIndexer())

  • Async worker thread that catches up indexing of historical events
  • Clean shutdown and progress persistence via SearchState.lastIndexedLevId
  • Complemented by on-write indexing in the writer path (new events are indexed immediately)

Search Runner (src/search/SearchRunner.h)

  • Executes search queries within the existing query scheduler
  • Integrates alongside traditional index scans
  • Validates content by requiring presence of all parsed query tokens in event text
  • BM25 scoring (k1=1.2, b=0.75)

Database Schema

New LMDB tables (defined in golpe.yaml):

SearchIndex (DUPSORT)
  keys: tokens (lowercase, normalized strings)
  vals: postings [levId:48 bits][tf:16 bits] packed as host-endian uint64

SearchDocMeta (INTEGERKEY)
  keys: levIds (uint64)
  vals: packed [docLen:16][kind:16][reserved:32] as uint64

SearchState
  - lastIndexedLevId: tracks indexing progress
  - indexVersion: schema version for future migrations

Configuration

Key settings in strfry.conf (relay.search):

relay {
  search {
    enabled = true                  # Enable NIP‑50 search
    backend = "lmdb"                # or "noop"

    # Indexing/Query controls
    indexedKinds = "1, 30023"       # Kind pattern: numbers, ranges, '*', exclusions (-A-B)
    maxQueryTerms = 16              # Max terms parsed from a query
    maxPostingsPerToken = 100000    # Cap per token (pruning/vacuum TBD)
    maxCandidateDocs = 1000         # Max candidate docs before scoring
    overfetchFactor = 5             # Fetch limit × factor, bounded by maxCandidateDocs

    # Recency tie-breaker (optional)
    recencyBoostPercent = 0         # Integer percent (0–100); 1 = 1%

    # Candidate pre-scoring ranking
    candidateRankMode = "order"     # "order" | "weighted"
    candidateRanking = "terms-tf-recency"  # When mode="order": see supported orders below
    rankWeightTerms = 100           # When mode="weighted": weight for matched terms
    rankWeightTf = 50               # When mode="weighted": weight for aggregate TF
    rankWeightRecency = 10          # When mode="weighted": weight for recency
  }
}

Supported candidateRanking orders (desc for each component):

  • terms-tf-recency (default)
  • terms-recency-tf
  • tf-terms-recency
  • tf-recency-terms
  • recency-terms-tf
  • recency-tf-terms

Configuration Parameters

  • enabled: Master switch for search functionality
  • backend: Search provider implementation ("lmdb" or "noop")
  • indexedKinds: Pattern of kinds to index (numbers/ranges/*/exclusions)
  • maxQueryTerms: Maximum query terms parsed
  • maxPostingsPerToken: Max postings per token key (upper bound during fetch; pruning TBD)
  • maxCandidateDocs: Maximum candidates for scoring
  • overfetchFactor: Candidate over-fetch before post-filtering
  • recencyBoostPercent: Recency tie-breaker percent (0–100; 1 = 1%)
  • candidateRankMode: order or weighted
  • candidateRanking: Order used when mode=order (list above)
  • rankWeightTerms/rankWeightTf/rankWeightRecency: Weights for mode=weighted

Usage

Enabling Search

  1. Build strfry:

    make -j$(nproc)
    
  2. Update strfry.conf:

    relay {
        search {
            enabled = true
            backend = "lmdb"
        }
    }
    
  3. Start strfry:

    ./build/strfry relay
    

Indexing behavior:

  • New events are indexed on write (writer path)
  • Background indexer catches up historical events and updates SearchState
  • NIP‑11 advertises 50 when the provider is healthy (index present and near head)

Search Queries

Clients can issue NIP-50 search queries using the search filter field:

{
  "kinds": [1],
  "search": "bitcoin lightning network",
  "limit": 100
}

Search features:

  • Multi-token queries with BM25 relevance scoring
  • Case-insensitive matching
  • Results ranked by relevance
  • Combines with other filter criteria (kinds, authors, tags, etc.)

Monitoring

Background indexer logs:

Search indexer catching up: <startLevId> to <endLevId> (head: <mostRecent>)

Query metrics include search-specific timings when relay.logging.dbScanPerf = true (scan=Search).

Performance Characteristics

Indexing Performance

  • Tokenization: ~10-15 us/event (depends on content length)
  • Index insertion: ~50-100 us/event (LMDB commit overhead)
  • Catch-up rate: ~5000-10000 events/sec on NVMe SSDs

Query Performance

  • Simple queries (1-2 tokens): 5-20 ms (p50), 30-60 ms (p95)
  • Complex queries (3+ tokens): 10-40 ms (p50), 50-100 ms (p95)
  • Performance scales with maxCandidateDocs and result set size

Tuning guidelines:

  • Lower maxCandidateDocs for faster queries with slightly lower recall
  • Increase overfetchFactor to improve recall for multi-token queries

Benchmark Suite

Put something together for benchmarks, but didn't finish. Will likely remove it before marking ready for review A comprehensive benchmark suite is included under `bench/`:
bench/
├── README.md              # Benchmark plan and structure
├── SCENARIOS.md           # Scenario creation guide
├── scenarios/
│   ├── small.yml         # 100k events
│   └── medium.yml        # 1M events
└── scripts/
    ├── prepare.sh        # Generate and populate test databases
    ├── run.sh            # Execute benchmarks
    ├── sysinfo.sh        # Collect system info (sanitized)
    └── report.py         # Generate Markdown reports

Running Benchmarks

  1. Prepare a test database:

    bench/scripts/prepare.sh -s scenarios/small.yml --workers 4
    

    This generates cryptographically valid Nostr events using nak and ingests them into a fresh database.

  2. Run the benchmark:

    bench/scripts/run.sh -s scenarios/small.yml --out bench/results/raw/small-$(date +%s)
    
  3. Generate reports:

    bench/scripts/report.py bench/results/raw/* > bench/results/summary.md
    

Benchmark Metrics

  • Throughput: events/s sent and delivered
  • Latency: p50/p95/p99 for REQ scan, EVENT->OK, search queries
  • Resource usage: RSS memory, CPU utilization, disk I/O
  • Search-specific: index catch-up state, results cardinality
  • System profile: CPU model, memory, storage type (sanitized)

Testing

Manual Testing

  1. Index a test database:

    # Import some events
    cat events.ndjson | ./build/strfry import
    
    # Start relay with search enabled
    ./build/strfry relay
    
  2. Issue search queries via WebSocket:

    ["REQ", "test-sub", {"kinds": [1], "search": "nostr bitcoin", "limit": 50}]
    
  3. Verify results are returned in relevance order

Integration Points

  • DBQuery.h: Search queries execute alongside traditional index scans
  • ActiveMonitors.h: Search filters excluded from live subscription indexes (one-shot queries)
  • QueryScheduler.h: Search provider injected into query execution path
  • cmd_relay.cpp: Background indexer lifecycle management

Migration Notes

Existing Databases

For existing strfry installations:

  1. Stop the relay
  2. Rebuild with updated schema: cd golpe && ./build.sh && cd .. && make
  3. Enable search in config
  4. Restart relay

The indexer will automatically catch up on all existing events. Monitor logs for progress.

Rollback

To disable search without data loss:

  1. Set relay.search.enabled = false in config
  2. Restart relay

The search tables remain in the database but are not used. They can be manually removed using the mdb command-line tools if desired.

Known Limitations

  • Search is limited to content field of events (does not index tags or metadata)
  • No phrase matching or proximity operators (only individual tokens)
  • No stemming or lemmatization (exact token matching)
  • Large result sets may require tuning maxCandidateDocs for optimal performance
  • Search filters are one-shot queries and do not support live subscriptions

Future Enhancements

Potential improvements for future iterations:

  • Phrase search and proximity operators
  • Stemming and language-specific analyzers
  • Alternative backends (e.g., external Elasticsearch/MeiliSearch)
  • Search query cost accounting for rate limiting

Related Issues

  • Potentially Resolves #40
  • Implements NIP-50 as specified at: https://github.com/nostr-protocol/nips/blob/master/50.md

dskvr avatar Nov 12 '25 13:11 dskvr