[NEW] Add structured dataset support to valkey-benchmark
Currently, valkey-benchmark only supports synthetic data generation through placeholders like __rand_int__ and __data__. This limits realistic performance testing since synthetic data doesn't reflect real-world usage patterns, data distributions, or content characteristics that applications actually work with. We need this capability for our Full-text search work and believe it would benefit other use cases like JSON operations, VSS, and general data modeling.
Proposed Solution
Add a --dataset option to valkey-benchmark that loads structured data from files and introduces field-based placeholders:
valkey-benchmark --dataset products.jsonl -n 50000 \
HSET product:__field:id__ name "__field:name__" price __field:price__
New Placeholder Syntax
__field:columnname__: Replaced with data from specified dataset column in the file.
Supported file structure
CSV: Header row defines field names - title,content,category
TSV: Tab-separated with header - title\tcontent\tcategory
Parquet: Columnar binary format (for FTS) (requires library to support)
JSONL: Each line is JSON object - {"title": "...", "content": "...", "embedding": [...]} (requires library to support)
Details
- Pre-load dataset into memory during initialization
- Thread-safe row selection using atomic counters
- Extends existing placeholder system in valkey-benchmark.c
Use Cases
# FTS with real Wikipedia data
valkey-benchmark --dataset wikipedia.csv -n 100000 \
FT.SEARCH articles "@title:__field:title__"
# E-commerce product catalog
valkey-benchmark --dataset products.csv -n 50000 \
HSET product:__field:id__ name "__field:name__" category "__field:category__"
This seems like a good suggestion. But not really clear about the exact requirement. Could we start off with something limited addressing the needs of valkey-search? instead of dealing with so many data formats. And regarding data insertion, from what I understand search can index/query on both hash and JSON data structure. Do you want to handle both form of data loading ?
@roshkhatri / @rainsupreme I believe you both were looking into supporting real world workload scenarios on the automated benchmark framework. Did you have any thoughts on the data generation part?
Yes, I think this would be useful for benchmarking module commands in the framework as we do not support these as of now
Would this require variable-size replacements that are updated for every command? If so, I think that implies a fairly significant refactor of valkey-benchmark.
Currently valkey-benchmark is written with the assumption that placeholders get replaced character-for-character for anything that changes from command to command, like the key. (data size varies, but all commands contain an identical random string.)
Thank you for taking a look on the issue.
Could we start off with something limited addressing the needs of valkey-search? instead of dealing with so many data formats. And regarding data insertion, from what I understand search can index/query on both hash and JSON data structure. Do you want to handle both form of data loading ?
We want to start with supporting CSV/TSV formats as it requires no library support for parsing and enough for our search needs. Yes, we need both JSON and HASH support. I believe currently we have command line support that we can use to provide custom commands and need no additional changes like auto generate templates for specific command types.
Would this require variable-size replacements that are updated for every command? Yes, if we go with dynamic allocation approach. This requires some refactoring. It comes with problems like memory limitations, fragmentation issues as we dynamically alloc/free loaded data, and thread contentions. These all will be bottleneck for performance too.
Alternatively, We can do fixed size buffers(base size + safety margin) and replace current inplace holders with max command size. Will research and provide what is the max size we want to support that is not bottle for benchmarking while serving our testing needs.