risingwave source: support native (data chunk) format for benchmark purpose

When benchmarking with the Nexmark source (only with a count(*) simple agg), we find that even if we increase the split number (and the parallel unit count) from 6x3 to 7x3, the throughput does not improve while the CPU usage is increased by 100%. That is, the maximal throughput we can get from the Nexmark source is 750k rows per second; this will become the bottleneck for some workloads.

The Nexmark messages are generated and then serialized to JSON and then deserialized with the source parser, which brings some overhead. It would be better if we could support the "native" format by directly passing in-memory data chunks or columns.

Aug 10 '22 05:08 BugenZhao

For example, 64% of the time of nexmark bid generation is spent on formatting the timestamp. However, there seems no way to optimize this like pre-compile the formatter. :( cc @KeXiangWang

With "native" format, this overhead can be avoided.

Aug 10 '22 11:08 BugenZhao

related: https://github.com/singularity-data/risingwave/issues/4144

Aug 10 '22 13:08 skyzh

is there, equivalently, an observed high CPU consumption for deserializing string timestamp back to our in-memory representation of timestamp? (is that UTC time as u64?). If so, then there would be a problem with our code in being slow, not a problem with json serialization libraries as suggested in https://github.com/risingwavelabs/risingwave/issues/5122

Sep 06 '22 06:09 jon-chuang

is there, equivalently, an observed high CPU consumption for deserializing string timestamp back to our in-memory representation of timestamp? (is that UTC time as u64?). If so, then there would be a problem with our code in being slow, not a problem with json serialization libraries as suggested in https://github.com/risingwavelabs/risingwave/issues/5122

I've not observed deserializing one. However, if we can avoid any json ser/de in the path of nexmark (actually no need at all), we can directly pass the mills u64 and it could be much better.

Sep 06 '22 13:09 BugenZhao

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

Nov 07 '22 02:11 github-actions[bot]

This feature will facilitate the performance test of streaming operators, by eliminating the cost of Source Executor. Hope it can be done.

Nov 09 '22 04:11 fuyufjh

@TennyZhuang Can you do it recently? Or reassign?

Nov 09 '22 04:11 fuyufjh

This feature will facilitate the performance test of streaming operators, by eliminating the cost of Source Executor. Hope it can be done.

To clarify, this feature is nice-to-have but not essential. We can do it as long as the complexity is controllable and does not hurt the current design of connectors.

Nov 09 '22 05:11 fuyufjh

Just wondering if any design decision has been made, we can probably assign this to others if @wangrunji0408 is currently pretty busy with both external functions and expression micro-benchmarks/optimizations.

A quick check, I can only think of doing this unsafely without changing the code structure much(but still not sure if possible):

directly generate StreamChunk in datagen generator
unsafely reinterpret it as [u8]
a special parser in ConnectorSourceReader receives the payload (aka [u8]) and reconstructs the StreamChunk.

Hope I am wrong!

Dec 14 '22 08:12 lmatz

I guess the current performance after a series of optimization is good enough for micro benchmarks: as long as we have sufficient CPU cores, we can almost always make the throughput of data-gen larger than that of the downstream executor. 🤔

Dec 14 '22 08:12 BugenZhao

make the throughput of data-gen larger than that of the downstream executor.

I see, yes! it is not particularly urgent. It's useful when we try to get some peak performance numbers 🤣 for demo purposes.

Dec 14 '22 08:12 lmatz

create source t1 (
    v1 BIGINT, 
    v2 BIGINT, 
    t1 timestamp, 
    t2 timestamp,
    c1 varchar,
    c2 varchar
) with ( 
    connector = 'datagen', 
    datagen.split.num = '1', // one thread
    datagen.rows.per.second = '5000000'
) ROW FORMAT JSON;

create sink s1 as select * from t1 with ( connector = 'blackhole' );

roughly 220K/s (go through json ser/de-ser) -> 310K/s (generate row)

Dec 15 '22 15:12 lmatz

Hi all! Is this feature still required to meet performance requirements? If not, can I get it closed?

Jan 30 '23 05:01 wangrunji0408

Hi all! Is this feature still required to meet performance requirements? If not, can I get it closed?

I think yes? cc. @lmatz

BTW, this would be easier after https://github.com/risingwavelabs/rfcs/pull/31 cc. @waruto210

Jan 30 '23 05:01 fuyufjh

Hi all! Is this feature still required to meet performance requirements? If not, can I get it closed?

I think yes? cc. @lmatz

BTW, this would be easier after risingwavelabs/rfcs#31 cc. @waruto210

We can start doing it after #7508 is merged.

Jan 30 '23 06:01 waruto210

done by #7621

Feb 15 '23 08:02 waruto210

risingwave risingwave copied to clipboard

source: support native (data chunk) format for benchmark purpose

risingwave
risingwave copied to clipboard