Add structured datasets loading capability in valkey benchmark
Background
Currently, valkey-benchmark only supports synthetic data generation through placeholders like __rand_int__ and __data__. This limits realistic performance testing since synthetic data doesn't reflect real-world usage patterns, data distributions, or content characteristics that applications actually work with. We need this capability for our Full-text search work and believe it would benefit other use cases like JSON operations, VSS, and general data modeling.
Add structured datasets loading capability. Support XML/CSV/TSV file formats. Use __field:fieldname__ placeholders to replace the corresponding fields from the dataset file. Support natural content size of varying length. Allow mixed placeholder usage combining dataset fields with random generators. Enable automatic field discovery from CSV/TSV headers and XML tags. Use --maxdocs to limit the dataset loading.
Rather than modifying the existing placeholder system, we detect field placeholders and switch to a separate code path that builds commands from scratch using valkeyFormatCommandArgv(). This ensures:
- Zero impact on existing functionality
- Full support for variable-size content
- Thread-safe atomic record iteration
- Compatible with pipelining and threading modes
# Strings - Simple key-value with dataset fields
./valkey-benchmark --dataset products.csv -n 10000 SET product:__rand_int__ "__field:name__"
# Sets - Unique collections from dataset
./valkey-benchmark --dataset categories.csv -n 10000 SADD tags:__rand_int__ "__field:category__"
# XML dataset with document limit
./valkey-benchmark --dataset wiki.xml --xml-root-element doc --maxdocs 100000 -n 50000 HSET doc:__rand_int__ title "__field:title__" body "__field:abstract__"
# Mixed placeholders (dataset + random)
./valkey-benchmark --dataset terms.csv -r 5000000 -n 50000 HSET search:__rand_int__ term "__field:term__" score __rand_1st__
Full-Text Search Benchmarking
# Search hit scenarios (existing terms)
./valkey-benchmark --dataset search_terms.csv -n 50000 FT.SEARCH rd0 "__field:term__"
# Search miss scenarios (non-existent terms)
./valkey-benchmark --dataset miss_terms.csv -n 50000 FT.SEARCH rd0 "__field:term__"
# Query variations
./valkey-benchmark --dataset search_terms.csv -n 50000 FT.SEARCH rd0 "@title:__field:term__"
./valkey-benchmark --dataset search_terms.csv -n 50000 FT.SEARCH rd0 "__field:term__*"
Test environment: Instance: AWS c7i.16xlarge, 64 vCPU
Test Dataset: 5M+ Wikipedia XML documents, 5.8GB memory
| Configuration | Throughput | CPU Usage | Wall Time | Memory Peak |
|---|---|---|---|---|
| Single-threaded, P1 | 93,295 RPS | 99% | 71.4s | 5.8GB |
| Multi-threaded (10), P1 | 93,332 RPS | 137% | 71.5s | 5.8GB |
| Single-threaded, P10 | 274,499 RPS | 96% | 36.1s | 5.8GB |
| Multi-threaded (4), P10 | 344,589 RPS | 161% | 32.4s | 5.8GB |
Codecov Report
:x: Patch coverage is 85.91270% with 71 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 72.54%. Comparing base (e19ceb7) to head (b3a4516).
:warning: Report is 32 commits behind head on unstable.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/valkey-benchmark.c | 85.91% | 71 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## unstable #2823 +/- ##
============================================
+ Coverage 72.45% 72.54% +0.08%
============================================
Files 128 129 +1
Lines 70414 71020 +606
============================================
+ Hits 51019 51520 +501
- Misses 19395 19500 +105
| Files with missing lines | Coverage Δ | |
|---|---|---|
| src/valkey-benchmark.c | 67.89% <85.91%> (+6.18%) |
:arrow_up: |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
Nice. I appreciate the benchmark.md file. This is an excellent start and provides a place for future documentation.