Encode JDBC ResultSet into ByteBuffer using directBinaryEncoder, write using appendEncoded(). This avoids a bit of copying bytes between buffers.
Use a BlockingQueue to asynchronously read from JDBC and write to file.

Early experiments show that it can improve throughput by ~ 15 ~ 30 %.

master/#60 (https://travis-ci.org/spotify/dbeam/builds/520639708#L1545):

scenario    records  writeElapsedMs  msPerMillionRows  bytesWritten  kBps
deflate1t5  1000000  10288           10288             43722188      4249.823
deflate1t5  1000000  9884            9884              43722188      4423.531
deflate1t5  1000000  9819            9819              43722188      4452.814
||query     1000000  9888            8570              61220677      6191.411
||query     1000000  9835            8620              52475971      5335.635
||query     1000000  9874            8380              52475971      5314.56

#61 (just encode to binary, no multi threading) (https://travis-ci.org/spotify/dbeam/builds/520647554#L1524):

scenario    records  writeElapsedMs  msPerMillionRows  bytesWritten  kBps
deflate1t5  1000000  9674            9674              53403623      5520.324
deflate1t5  1000000  9407            9407              53403623      5677.008
deflate1t5  1000000  9396            9396              53403623      5683.655
||query     1000000  9484            8580              53411427      5631.74
||query     1000000  9596            8705              74772355      7792.033
||query     1000000  9800            9130              53411427      5450.145

This PR (https://travis-ci.org/spotify/dbeam/builds/520648264#L1539):

scenario    records  writeElapsedMs  msPerMillionRows  bytesWritten  kBps
deflate1t5  1000000  8464            8464              53406242      6309.811
deflate1t5  1000000  7791            7791              53406242      6854.863
deflate1t5  1000000  7741            7741              53406242      6899.139
||query     1000000  8110            7350              74773099      9219.864
||query     1000000  8510            7075              74773099      8786.498
||query     1000000  8204            7330              64093112      7812.422

Apr 15 '19 10:04 labianchin

The results look promising ..

Apr 17 '19 08:04 anish749

Codecov Report

Merging #59 into master will decrease coverage by 0.66%. The diff coverage is 81.81%.

@@             Coverage Diff              @@
##             master      #59      +/-   ##
============================================
- Coverage     89.69%   89.02%   -0.67%     
- Complexity      177      181       +4     
============================================
  Files            22       23       +1     
  Lines           679      711      +32     
  Branches         51       53       +2     
============================================
+ Hits            609      633      +24     
- Misses           47       54       +7     
- Partials         23       24       +1

Jul 08 '19 13:07 codecov[bot]

Codecov Report

Merging #59 into master will decrease coverage by 0.46%. The diff coverage is 84.61%.

@@             Coverage Diff              @@
##             master      #59      +/-   ##
============================================
- Coverage     90.08%   89.62%   -0.47%     
  Complexity      230      230              
============================================
  Files            25       26       +1     
  Lines           908      925      +17     
  Branches         65       65              
============================================
+ Hits            818      829      +11     
- Misses           59       65       +6     
  Partials         31       31

Aug 05 '19 13:08 codecov[bot]

I don't remember, but did you also try separating out the read and writes into two different DoFns? Does that move the thread management to Beam?

Sep 16 '19 21:09 anish749

I don't remember, but did you also try separating out the read and writes into two different DoFns? Does that move the thread management to Beam?

I remember trying on earlier versions of DBeam to have two phases: read JDBC and write to Avro. Problem was that Beam was waiting for the read bundle to complete, serialize and then write to Avro, which was very inefficient. If we found a way to "stream" from different DoFns (including the fanout: one jdbc query fans out to millions of records), then we could rely on Beam for that.

Sep 17 '19 16:09 labianchin

thanks, I remember now.

Sep 17 '19 16:09 anish749

dbeam
dbeam copied to clipboard

Improve throughput

Codecov Report

Codecov Report

dbeam dbeam copied to clipboard

Improve throughput

Codecov Report

Codecov Report

dbeam
dbeam copied to clipboard