dbeam
dbeam copied to clipboard
Improve throughput
- Encode JDBC ResultSet into ByteBuffer using
directBinaryEncoder
, write usingappendEncoded()
. This avoids a bit of copying bytes between buffers. - Use a
BlockingQueue
to asynchronously read from JDBC and write to file.
Early experiments show that it can improve throughput by ~ 15 ~ 30 %.
master
/#60 (https://travis-ci.org/spotify/dbeam/builds/520639708#L1545):
scenario records writeElapsedMs msPerMillionRows bytesWritten kBps
deflate1t5 1000000 10288 10288 43722188 4249.823
deflate1t5 1000000 9884 9884 43722188 4423.531
deflate1t5 1000000 9819 9819 43722188 4452.814
||query 1000000 9888 8570 61220677 6191.411
||query 1000000 9835 8620 52475971 5335.635
||query 1000000 9874 8380 52475971 5314.56
#61 (just encode to binary, no multi threading) (https://travis-ci.org/spotify/dbeam/builds/520647554#L1524):
scenario records writeElapsedMs msPerMillionRows bytesWritten kBps
deflate1t5 1000000 9674 9674 53403623 5520.324
deflate1t5 1000000 9407 9407 53403623 5677.008
deflate1t5 1000000 9396 9396 53403623 5683.655
||query 1000000 9484 8580 53411427 5631.74
||query 1000000 9596 8705 74772355 7792.033
||query 1000000 9800 9130 53411427 5450.145
This PR (https://travis-ci.org/spotify/dbeam/builds/520648264#L1539):
scenario records writeElapsedMs msPerMillionRows bytesWritten kBps
deflate1t5 1000000 8464 8464 53406242 6309.811
deflate1t5 1000000 7791 7791 53406242 6854.863
deflate1t5 1000000 7741 7741 53406242 6899.139
||query 1000000 8110 7350 74773099 9219.864
||query 1000000 8510 7075 74773099 8786.498
||query 1000000 8204 7330 64093112 7812.422
The results look promising ..
Codecov Report
Merging #59 into master will decrease coverage by
0.66%
. The diff coverage is81.81%
.
@@ Coverage Diff @@
## master #59 +/- ##
============================================
- Coverage 89.69% 89.02% -0.67%
- Complexity 177 181 +4
============================================
Files 22 23 +1
Lines 679 711 +32
Branches 51 53 +2
============================================
+ Hits 609 633 +24
- Misses 47 54 +7
- Partials 23 24 +1
Codecov Report
Merging #59 into master will decrease coverage by
0.46%
. The diff coverage is84.61%
.
@@ Coverage Diff @@
## master #59 +/- ##
============================================
- Coverage 90.08% 89.62% -0.47%
Complexity 230 230
============================================
Files 25 26 +1
Lines 908 925 +17
Branches 65 65
============================================
+ Hits 818 829 +11
- Misses 59 65 +6
Partials 31 31
I don't remember, but did you also try separating out the read and writes into two different DoFn
s? Does that move the thread management to Beam?
I don't remember, but did you also try separating out the read and writes into two different DoFns? Does that move the thread management to Beam?
I remember trying on earlier versions of DBeam to have two phases: read JDBC and write to Avro. Problem was that Beam was waiting for the read bundle to complete, serialize and then write to Avro, which was very inefficient. If we found a way to "stream" from different DoFn
s (including the fanout: one jdbc query fans out to millions of records), then we could rely on Beam for that.
thanks, I remember now.