feat: add NIP-50 support

Open dskvr opened this issue 2 months ago • 0 comments

Overview

This PR implements NIP-50 (Search Capability) for strfry, enabling full-text search across Nostr events using BM25 ranking. The implementation includes:

Full-text search with relevance ranking (BM25 algorithm)
Configurable search backends (LMDB, Noop)
Background indexer with catch-up mechanism
Production-ready performance optimizations
benchmark suite*

Architecture

Core Components

Search Provider Interface (src/search/SearchProvider.h)

Abstract interface allowing pluggable search backends
Supports index creation, document insertion, and search queries

LMDB Search Backend (src/search/LmdbSearchProvider.h)

Inverted index stored in LMDB tables
Token-based posting lists with term frequency data
Document metadata for BM25 scoring (document length, kind)
Efficient packed binary format for postings

Background Indexer (in LmdbSearchProvider::runCatchupIndexer())

Async worker thread that catches up indexing of historical events
Clean shutdown and progress persistence via SearchState.lastIndexedLevId
Complemented by on-write indexing in the writer path (new events are indexed immediately)

Search Runner (src/search/SearchRunner.h)

Executes search queries within the existing query scheduler
Integrates alongside traditional index scans
Validates content by requiring presence of all parsed query tokens in event text
BM25 scoring (k1=1.2, b=0.75)

Database Schema

New LMDB tables (defined in golpe.yaml):

SearchIndex (DUPSORT)
  keys: tokens (lowercase, normalized strings)
  vals: postings [levId:48 bits][tf:16 bits] packed as host-endian uint64

SearchDocMeta (INTEGERKEY)
  keys: levIds (uint64)
  vals: packed [docLen:16][kind:16][reserved:32] as uint64

SearchState
  - lastIndexedLevId: tracks indexing progress
  - indexVersion: schema version for future migrations

Configuration

Key settings in strfry.conf (relay.search):

relay {
  search {
    enabled = true                  # Enable NIP‑50 search
    backend = "lmdb"                # or "noop"

    # Indexing/Query controls
    indexedKinds = "1, 30023"       # Kind pattern: numbers, ranges, '*', exclusions (-A-B)
    maxQueryTerms = 16              # Max terms parsed from a query
    maxPostingsPerToken = 100000    # Cap per token (pruning/vacuum TBD)
    maxCandidateDocs = 1000         # Max candidate docs before scoring
    overfetchFactor = 5             # Fetch limit × factor, bounded by maxCandidateDocs

    # Recency tie-breaker (optional)
    recencyBoostPercent = 0         # Integer percent (0–100); 1 = 1%

    # Candidate pre-scoring ranking
    candidateRankMode = "order"     # "order" | "weighted"
    candidateRanking = "terms-tf-recency"  # When mode="order": see supported orders below
    rankWeightTerms = 100           # When mode="weighted": weight for matched terms
    rankWeightTf = 50               # When mode="weighted": weight for aggregate TF
    rankWeightRecency = 10          # When mode="weighted": weight for recency
  }
}

Supported candidateRanking orders (desc for each component):

terms-tf-recency (default)
terms-recency-tf
tf-terms-recency
tf-recency-terms
recency-terms-tf
recency-tf-terms

Configuration Parameters

enabled: Master switch for search functionality
backend: Search provider implementation ("lmdb" or "noop")
indexedKinds: Pattern of kinds to index (numbers/ranges/*/exclusions)
maxQueryTerms: Maximum query terms parsed
maxPostingsPerToken: Max postings per token key (upper bound during fetch; pruning TBD)
maxCandidateDocs: Maximum candidates for scoring
overfetchFactor: Candidate over-fetch before post-filtering
recencyBoostPercent: Recency tie-breaker percent (0–100; 1 = 1%)
candidateRankMode: order or weighted
candidateRanking: Order used when mode=order (list above)
rankWeightTerms/rankWeightTf/rankWeightRecency: Weights for mode=weighted

Usage

Enabling Search

Build strfry:
```
make -j$(nproc)
```

Update strfry.conf:

relay {
    search {
        enabled = true
        backend = "lmdb"
    }
}

Start strfry:
```
./build/strfry relay
```

Indexing behavior:

New events are indexed on write (writer path)
Background indexer catches up historical events and updates SearchState
NIP‑11 advertises 50 when the provider is healthy (index present and near head)

Search Queries

Clients can issue NIP-50 search queries using the search filter field:

{
  "kinds": [1],
  "search": "bitcoin lightning network",
  "limit": 100
}

Search features:

Multi-token queries with BM25 relevance scoring
Case-insensitive matching
Results ranked by relevance
Combines with other filter criteria (kinds, authors, tags, etc.)

Monitoring

Background indexer logs:

Search indexer catching up: <startLevId> to <endLevId> (head: <mostRecent>)

Query metrics include search-specific timings when relay.logging.dbScanPerf = true (scan=Search).

Performance Characteristics

Indexing Performance

Tokenization: ~10-15 us/event (depends on content length)
Index insertion: ~50-100 us/event (LMDB commit overhead)
Catch-up rate: ~5000-10000 events/sec on NVMe SSDs

Query Performance

Simple queries (1-2 tokens): 5-20 ms (p50), 30-60 ms (p95)
Complex queries (3+ tokens): 10-40 ms (p50), 50-100 ms (p95)
Performance scales with maxCandidateDocs and result set size

Tuning guidelines:

Lower maxCandidateDocs for faster queries with slightly lower recall
Increase overfetchFactor to improve recall for multi-token queries

Benchmark Suite

Put something together for benchmarks, but didn't finish. Will likely remove it before marking ready for review

A comprehensive benchmark suite is included under `bench/`:

bench/
├── README.md              # Benchmark plan and structure
├── SCENARIOS.md           # Scenario creation guide
├── scenarios/
│   ├── small.yml         # 100k events
│   └── medium.yml        # 1M events
└── scripts/
    ├── prepare.sh        # Generate and populate test databases
    ├── run.sh            # Execute benchmarks
    ├── sysinfo.sh        # Collect system info (sanitized)
    └── report.py         # Generate Markdown reports

Running Benchmarks

Prepare a test database:
```
bench/scripts/prepare.sh -s scenarios/small.yml --workers 4
```
This generates cryptographically valid Nostr events using nak and ingests them into a fresh database.

Run the benchmark:

bench/scripts/run.sh -s scenarios/small.yml --out bench/results/raw/small-$(date +%s)

Generate reports:

bench/scripts/report.py bench/results/raw/* > bench/results/summary.md

Benchmark Metrics

Throughput: events/s sent and delivered
Latency: p50/p95/p99 for REQ scan, EVENT->OK, search queries
Resource usage: RSS memory, CPU utilization, disk I/O
Search-specific: index catch-up state, results cardinality
System profile: CPU model, memory, storage type (sanitized)

Testing

Manual Testing

Index a test database:

# Import some events
cat events.ndjson | ./build/strfry import

# Start relay with search enabled
./build/strfry relay

Issue search queries via WebSocket:

["REQ", "test-sub", {"kinds": [1], "search": "nostr bitcoin", "limit": 50}]

Verify results are returned in relevance order

Integration Points

DBQuery.h: Search queries execute alongside traditional index scans
ActiveMonitors.h: Search filters excluded from live subscription indexes (one-shot queries)
QueryScheduler.h: Search provider injected into query execution path
cmd_relay.cpp: Background indexer lifecycle management

Migration Notes

Existing Databases

For existing strfry installations:

Stop the relay
Rebuild with updated schema: cd golpe && ./build.sh && cd .. && make
Enable search in config
Restart relay

The indexer will automatically catch up on all existing events. Monitor logs for progress.

Rollback

To disable search without data loss:

Set relay.search.enabled = false in config
Restart relay

The search tables remain in the database but are not used. They can be manually removed using the mdb command-line tools if desired.

Known Limitations

Search is limited to content field of events (does not index tags or metadata)
No phrase matching or proximity operators (only individual tokens)
No stemming or lemmatization (exact token matching)
Large result sets may require tuning maxCandidateDocs for optimal performance
Search filters are one-shot queries and do not support live subscriptions

Future Enhancements

Potential improvements for future iterations:

Phrase search and proximity operators
Stemming and language-specific analyzers
Alternative backends (e.g., external Elasticsearch/MeiliSearch)
Search query cost accounting for rate limiting

Related Issues

Potentially Resolves #40
Implements NIP-50 as specified at: https://github.com/nostr-protocol/nips/blob/master/50.md

Nov 12 '25 13:11 dskvr