valkey icon indicating copy to clipboard operation
valkey copied to clipboard

[NEW] Add structured dataset support to valkey-benchmark

Open VoletiRam opened this issue 2 months ago • 5 comments

Currently, valkey-benchmark only supports synthetic data generation through placeholders like __rand_int__ and __data__. This limits realistic performance testing since synthetic data doesn't reflect real-world usage patterns, data distributions, or content characteristics that applications actually work with. We need this capability for our Full-text search work and believe it would benefit other use cases like JSON operations, VSS, and general data modeling.

Proposed Solution

Add a --dataset option to valkey-benchmark that loads structured data from files and introduces field-based placeholders:


valkey-benchmark --dataset products.jsonl -n 50000 \
  HSET product:__field:id__ name "__field:name__" price __field:price__

New Placeholder Syntax

__field:columnname__: Replaced with data from specified dataset column in the file.

Supported file structure

CSV: Header row defines field names - title,content,category

TSV: Tab-separated with header - title\tcontent\tcategory

Parquet: Columnar binary format (for FTS) (requires library to support)

JSONL: Each line is JSON object - {"title": "...", "content": "...", "embedding": [...]} (requires library to support)

Details

  • Pre-load dataset into memory during initialization
  • Thread-safe row selection using atomic counters
  • Extends existing placeholder system in valkey-benchmark.c

Use Cases

# FTS with real Wikipedia data
valkey-benchmark --dataset wikipedia.csv -n 100000 \
  FT.SEARCH articles "@title:__field:title__"

# E-commerce product catalog
valkey-benchmark --dataset products.csv -n 50000 \
  HSET product:__field:id__ name "__field:name__" category "__field:category__"

VoletiRam avatar Oct 24 '25 00:10 VoletiRam

This seems like a good suggestion. But not really clear about the exact requirement. Could we start off with something limited addressing the needs of valkey-search? instead of dealing with so many data formats. And regarding data insertion, from what I understand search can index/query on both hash and JSON data structure. Do you want to handle both form of data loading ?

hpatro avatar Oct 28 '25 21:10 hpatro

@roshkhatri / @rainsupreme I believe you both were looking into supporting real world workload scenarios on the automated benchmark framework. Did you have any thoughts on the data generation part?

hpatro avatar Oct 28 '25 21:10 hpatro

Yes, I think this would be useful for benchmarking module commands in the framework as we do not support these as of now

roshkhatri avatar Oct 28 '25 21:10 roshkhatri

Would this require variable-size replacements that are updated for every command? If so, I think that implies a fairly significant refactor of valkey-benchmark.

Currently valkey-benchmark is written with the assumption that placeholders get replaced character-for-character for anything that changes from command to command, like the key. (data size varies, but all commands contain an identical random string.)

rainsupreme avatar Oct 29 '25 16:10 rainsupreme

Thank you for taking a look on the issue.

Could we start off with something limited addressing the needs of valkey-search? instead of dealing with so many data formats. And regarding data insertion, from what I understand search can index/query on both hash and JSON data structure. Do you want to handle both form of data loading ?

We want to start with supporting CSV/TSV formats as it requires no library support for parsing and enough for our search needs. Yes, we need both JSON and HASH support. I believe currently we have command line support that we can use to provide custom commands and need no additional changes like auto generate templates for specific command types.

Would this require variable-size replacements that are updated for every command? Yes, if we go with dynamic allocation approach. This requires some refactoring. It comes with problems like memory limitations, fragmentation issues as we dynamically alloc/free loaded data, and thread contentions. These all will be bottleneck for performance too.

Alternatively, We can do fixed size buffers(base size + safety margin) and replace current inplace holders with max command size. Will research and provide what is the max size we want to support that is not bottle for benchmarking while serving our testing needs.

VoletiRam avatar Nov 03 '25 20:11 VoletiRam