dd-trace-java icon indicating copy to clipboard operation
dd-trace-java copied to clipboard

Track high watermark offsets

Open piochelepiotr opened this issue 1 year ago • 5 comments
trafficstars

What Does This Do

Track high watermark offsets along with produce and commit offsets. This information can be used to determine Kafka lag of consumers. So we can now get the Kafka lag by only instrumenting the consumer service, with no instrumentation on the producer side.

Motivation

Jira ticket: [PROJ-IDENT]

piochelepiotr avatar Jan 18 '24 18:01 piochelepiotr

Benchmarks

Startup

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
git_branch master piotr-wolski/add-high-watermark
git_commit_date 1705694656 1705724754
git_commit_sha fcb4a55f20 c25937a4cc
release_version 1.29.0-SNAPSHOT~fcb4a55f20 1.28.0-SNAPSHOT~c25937a4cc
See matching parameters
Baseline Candidate
application insecure-bank insecure-bank
ci_job_date 1705727774 1705727774
ci_job_id 413836885 413836885
ci_pipeline_id 26879332 26879332
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
module Agent Agent
parent None None
variant iast iast

Summary

Found 1 performance improvements and 3 performance regressions! Performance is the same for 40 metrics, 10 unstable metrics.

scenario Δ mean execution_time candidate mean execution_time baseline mean execution_time
scenario:startup:insecure-bank:iast_TELEMETRY_OFF:AppSec better
[-7.728ms; -3.445ms] or [-13.989%; -6.236%]
49.653ms 55.239ms
scenario:startup:insecure-bank:tracing:GlobalTracer worse
[+8.638ms; +17.397ms] or [+2.919%; +5.878%]
308.978ms 295.961ms
scenario:startup:petclinic:appsec:GlobalTracer worse
[+9.887ms; +19.982ms] or [+3.341%; +6.752%]
310.863ms 295.928ms
scenario:startup:petclinic:tracing:GlobalTracer worse
[+6.462ms; +13.744ms] or [+2.173%; +4.622%]
307.460ms 297.357ms
Startup time reports for petclinic
gantt
    title petclinic - global startup overhead: candidate=1.28.0-SNAPSHOT~c25937a4cc, baseline=1.29.0-SNAPSHOT~fcb4a55f20

    dateFormat X
    axisFormat %s
section tracing
Agent [baseline] (1.06 s) : 0, 1059919
Total [baseline] (9.471 s) : 0, 9470649
Agent [candidate] (1.049 s) : 0, 1049455
Total [candidate] (9.339 s) : 0, 9338601
section appsec
Agent [baseline] (1.153 s) : 0, 1152902
Total [baseline] (9.48 s) : 0, 9480437
Agent [candidate] (1.162 s) : 0, 1162253
Total [candidate] (9.503 s) : 0, 9503410
section iast
Agent [baseline] (1.195 s) : 0, 1194624
Total [baseline] (9.647 s) : 0, 9646992
Agent [candidate] (1.179 s) : 0, 1178642
Total [candidate] (9.649 s) : 0, 9649215
section profiling
Agent [baseline] (1.277 s) : 0, 1277287
Total [baseline] (9.603 s) : 0, 9603415
Agent [candidate] (1.272 s) : 0, 1272488
Total [candidate] (9.594 s) : 0, 9593981
  • baseline results
Module Variant Duration Δ tracing
Agent tracing 1.06 s -
Agent appsec 1.153 s 92.983 ms (8.8%)
Agent iast 1.195 s 134.705 ms (12.7%)
Agent profiling 1.277 s 217.368 ms (20.5%)
Total tracing 9.471 s -
Total appsec 9.48 s 9.787 ms (0.1%)
Total iast 9.647 s 176.342 ms (1.9%)
Total profiling 9.603 s 132.765 ms (1.4%)
  • candidate results
Module Variant Duration Δ tracing
Agent tracing 1.049 s -
Agent appsec 1.162 s 112.798 ms (10.7%)
Agent iast 1.179 s 129.187 ms (12.3%)
Agent profiling 1.272 s 223.034 ms (21.3%)
Total tracing 9.339 s -
Total appsec 9.503 s 164.809 ms (1.8%)
Total iast 9.649 s 310.614 ms (3.3%)
Total profiling 9.594 s 255.38 ms (2.7%)
gantt
    title petclinic - break down per module: candidate=1.28.0-SNAPSHOT~c25937a4cc, baseline=1.29.0-SNAPSHOT~fcb4a55f20

    dateFormat X
    axisFormat %s
section tracing
BytebuddyAgent [baseline] (669.455 ms) : 0, 669455
BytebuddyAgent [candidate] (649.218 ms) : 0, 649218
GlobalTracer [baseline] (297.357 ms) : 0, 297357
GlobalTracer [candidate] (307.46 ms) : 0, 307460
AppSec [baseline] (50.736 ms) : 0, 50736
AppSec [candidate] (50.643 ms) : 0, 50643
Remote Config [baseline] (663.158 µs) : 0, 663
Remote Config [candidate] (674.114 µs) : 0, 674
Telemetry [baseline] (7.242 ms) : 0, 7242
Telemetry [candidate] (7.249 ms) : 0, 7249
section appsec
BytebuddyAgent [baseline] (666.638 ms) : 0, 666638
BytebuddyAgent [candidate] (659.107 ms) : 0, 659107
GlobalTracer [baseline] (295.928 ms) : 0, 295928
GlobalTracer [candidate] (310.863 ms) : 0, 310863
AppSec [baseline] (148.475 ms) : 0, 148475
AppSec [candidate] (149.932 ms) : 0, 149932
Remote Config [baseline] (643.743 µs) : 0, 644
Remote Config [candidate] (658.353 µs) : 0, 658
Telemetry [baseline] (6.9 ms) : 0, 6900
Telemetry [candidate] (7.009 ms) : 0, 7009
section iast
BytebuddyAgent [baseline] (788.024 ms) : 0, 788024
BytebuddyAgent [candidate] (776.13 ms) : 0, 776130
GlobalTracer [baseline] (291.033 ms) : 0, 291033
GlobalTracer [candidate] (288.0 ms) : 0, 288000
AppSec [baseline] (52.92 ms) : 0, 52920
AppSec [candidate] (49.782 ms) : 0, 49782
Remote Config [baseline] (634.999 µs) : 0, 635
Remote Config [candidate] (567.755 µs) : 0, 568
Telemetry [baseline] (7.545 ms) : 0, 7545
Telemetry [candidate] (6.512 ms) : 0, 6512
IAST [baseline] (19.486 ms) : 0, 19486
IAST [candidate] (23.129 ms) : 0, 23129
section profiling
BytebuddyAgent [baseline] (662.93 ms) : 0, 662930
BytebuddyAgent [candidate] (660.825 ms) : 0, 660825
GlobalTracer [baseline] (376.465 ms) : 0, 376465
GlobalTracer [candidate] (375.424 ms) : 0, 375424
AppSec [baseline] (51.28 ms) : 0, 51280
AppSec [candidate] (51.022 ms) : 0, 51022
Remote Config [baseline] (988.611 µs) : 0, 989
Remote Config [candidate] (1.009 ms) : 0, 1009
Telemetry [baseline] (7.288 ms) : 0, 7288
Telemetry [candidate] (7.167 ms) : 0, 7167
ProfilingAgent [baseline] (123.86 ms) : 0, 123860
ProfilingAgent [candidate] (122.801 ms) : 0, 122801
Profiling [baseline] (123.887 ms) : 0, 123887
Profiling [candidate] (122.827 ms) : 0, 122827
Startup time reports for insecure-bank
gantt
    title insecure-bank - global startup overhead: candidate=1.28.0-SNAPSHOT~c25937a4cc, baseline=1.29.0-SNAPSHOT~fcb4a55f20

    dateFormat X
    axisFormat %s
section tracing
Agent [baseline] (1.055 s) : 0, 1055285
Total [baseline] (8.742 s) : 0, 8741766
Agent [candidate] (1.057 s) : 0, 1056648
Total [candidate] (8.728 s) : 0, 8727608
section iast
Agent [baseline] (1.178 s) : 0, 1178318
Total [baseline] (9.314 s) : 0, 9314329
Agent [candidate] (1.172 s) : 0, 1172436
Total [candidate] (9.275 s) : 0, 9275058
section iast_TELEMETRY_OFF
Agent [baseline] (1.165 s) : 0, 1165245
Total [baseline] (9.239 s) : 0, 9239481
Agent [candidate] (1.17 s) : 0, 1170140
Total [candidate] (9.224 s) : 0, 9224129
  • baseline results
Module Variant Duration Δ tracing
Agent tracing 1.055 s -
Agent iast 1.178 s 123.033 ms (11.7%)
Agent iast_TELEMETRY_OFF 1.165 s 109.96 ms (10.4%)
Total tracing 8.742 s -
Total iast 9.314 s 572.563 ms (6.5%)
Total iast_TELEMETRY_OFF 9.239 s 497.715 ms (5.7%)
  • candidate results
Module Variant Duration Δ tracing
Agent tracing 1.057 s -
Agent iast 1.172 s 115.788 ms (11.0%)
Agent iast_TELEMETRY_OFF 1.17 s 113.492 ms (10.7%)
Total tracing 8.728 s -
Total iast 9.275 s 547.45 ms (6.3%)
Total iast_TELEMETRY_OFF 9.224 s 496.521 ms (5.7%)
gantt
    title insecure-bank - break down per module: candidate=1.28.0-SNAPSHOT~c25937a4cc, baseline=1.29.0-SNAPSHOT~fcb4a55f20

    dateFormat X
    axisFormat %s
section tracing
BytebuddyAgent [baseline] (666.412 ms) : 0, 666412
BytebuddyAgent [candidate] (654.125 ms) : 0, 654125
GlobalTracer [baseline] (295.961 ms) : 0, 295961
GlobalTracer [candidate] (308.978 ms) : 0, 308978
AppSec [baseline] (50.61 ms) : 0, 50610
AppSec [candidate] (51.185 ms) : 0, 51185
Remote Config [baseline] (671.163 µs) : 0, 671
Remote Config [candidate] (676.966 µs) : 0, 677
Telemetry [baseline] (7.286 ms) : 0, 7286
Telemetry [candidate] (7.239 ms) : 0, 7239
section iast
BytebuddyAgent [baseline] (774.269 ms) : 0, 774269
BytebuddyAgent [candidate] (772.091 ms) : 0, 772091
GlobalTracer [baseline] (288.777 ms) : 0, 288777
GlobalTracer [candidate] (286.808 ms) : 0, 286808
AppSec [baseline] (53.177 ms) : 0, 53177
AppSec [candidate] (52.32 ms) : 0, 52320
IAST [baseline] (19.664 ms) : 0, 19664
IAST [candidate] (19.725 ms) : 0, 19725
Remote Config [baseline] (618.73 µs) : 0, 619
Remote Config [candidate] (568.245 µs) : 0, 568
Telemetry [baseline] (7.427 ms) : 0, 7427
Telemetry [candidate] (6.49 ms) : 0, 6490
section iast_TELEMETRY_OFF
BytebuddyAgent [baseline] (764.968 ms) : 0, 764968
BytebuddyAgent [candidate] (769.086 ms) : 0, 769086
GlobalTracer [baseline] (285.618 ms) : 0, 285618
GlobalTracer [candidate] (287.847 ms) : 0, 287847
AppSec [baseline] (55.239 ms) : 0, 55239
AppSec [candidate] (49.653 ms) : 0, 49653
IAST [baseline] (18.242 ms) : 0, 18242
IAST [candidate] (21.237 ms) : 0, 21237
Remote Config [baseline] (599.923 µs) : 0, 600
Remote Config [candidate] (1.299 ms) : 0, 1299
Telemetry [baseline] (6.349 ms) : 0, 6349
Telemetry [candidate] (6.51 ms) : 0, 6510

Load

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
end_time 2024-01-20T04:55:19 2024-01-20T05:11:57
git_branch master piotr-wolski/add-high-watermark
git_commit_date 1705694656 1705724754
git_commit_sha fcb4a55f20 c25937a4cc
release_version 1.29.0-SNAPSHOT~fcb4a55f20 1.28.0-SNAPSHOT~c25937a4cc
start_time 2024-01-20T04:55:06 2024-01-20T05:11:44
See matching parameters
Baseline Candidate
application insecure-bank insecure-bank
ci_job_date 1705727774 1705727774
ci_job_id 413836885 413836885
ci_pipeline_id 26879332 26879332
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
variant iast iast

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 8 metrics, 14 unstable metrics.

Request duration reports for petclinic
gantt
    title petclinic - request duration [CI 0.99] : candidate=1.28.0-SNAPSHOT~c25937a4cc, baseline=1.29.0-SNAPSHOT~fcb4a55f20
    dateFormat X
    axisFormat %s
section baseline
no_agent (1.344 ms) : 1325, 1363
.   : milestone, 1344,
appsec (1.787 ms) : 1762, 1813
.   : milestone, 1787,
iast (1.524 ms) : 1500, 1549
.   : milestone, 1524,
profiling (1.515 ms) : 1490, 1540
.   : milestone, 1515,
tracing (1.495 ms) : 1471, 1520
.   : milestone, 1495,
section candidate
no_agent (1.352 ms) : 1333, 1371
.   : milestone, 1352,
appsec (1.779 ms) : 1753, 1804
.   : milestone, 1779,
iast (1.514 ms) : 1489, 1538
.   : milestone, 1514,
profiling (1.522 ms) : 1497, 1547
.   : milestone, 1522,
tracing (1.512 ms) : 1487, 1537
.   : milestone, 1512,
  • baseline results
Variant Request duration [CI 0.99] Δ no_agent
no_agent 1.344 ms [1.325 ms, 1.363 ms] -
appsec 1.787 ms [1.762 ms, 1.813 ms] 443.634 µs (33.0%)
iast 1.524 ms [1.5 ms, 1.549 ms] 180.406 µs (13.4%)
profiling 1.515 ms [1.49 ms, 1.54 ms] 171.289 µs (12.7%)
tracing 1.495 ms [1.471 ms, 1.52 ms] 151.374 µs (11.3%)
  • candidate results
Variant Request duration [CI 0.99] Δ no_agent
no_agent 1.352 ms [1.333 ms, 1.371 ms] -
appsec 1.779 ms [1.753 ms, 1.804 ms] 426.859 µs (31.6%)
iast 1.514 ms [1.489 ms, 1.538 ms] 161.881 µs (12.0%)
profiling 1.522 ms [1.497 ms, 1.547 ms] 170.339 µs (12.6%)
tracing 1.512 ms [1.487 ms, 1.537 ms] 159.932 µs (11.8%)
Request duration reports for insecure-bank
gantt
    title insecure-bank - request duration [CI 0.99] : candidate=1.28.0-SNAPSHOT~c25937a4cc, baseline=1.29.0-SNAPSHOT~fcb4a55f20
    dateFormat X
    axisFormat %s
section baseline
no_agent (361.449 µs) : 342, 381
.   : milestone, 361,
iast (477.627 µs) : 457, 498
.   : milestone, 478,
iast_FULL (548.129 µs) : 527, 569
.   : milestone, 548,
iast_INACTIVE (451.63 µs) : 430, 473
.   : milestone, 452,
iast_TELEMETRY_OFF (469.503 µs) : 449, 490
.   : milestone, 470,
tracing (443.754 µs) : 423, 465
.   : milestone, 444,
section candidate
no_agent (370.102 µs) : 350, 390
.   : milestone, 370,
iast (480.12 µs) : 459, 501
.   : milestone, 480,
iast_FULL (550.931 µs) : 531, 571
.   : milestone, 551,
iast_INACTIVE (453.108 µs) : 431, 475
.   : milestone, 453,
iast_TELEMETRY_OFF (476.512 µs) : 455, 498
.   : milestone, 477,
tracing (441.924 µs) : 421, 463
.   : milestone, 442,
  • baseline results
Variant Request duration [CI 0.99] Δ no_agent
no_agent 361.449 µs [341.874 µs, 381.024 µs] -
iast 477.627 µs [457.353 µs, 497.901 µs] 116.179 µs (32.1%)
iast_FULL 548.129 µs [527.056 µs, 569.202 µs] 186.68 µs (51.6%)
iast_INACTIVE 451.63 µs [430.146 µs, 473.113 µs] 90.181 µs (24.9%)
iast_TELEMETRY_OFF 469.503 µs [449.082 µs, 489.923 µs] 108.054 µs (29.9%)
tracing 443.754 µs [422.997 µs, 464.51 µs] 82.305 µs (22.8%)
  • candidate results
Variant Request duration [CI 0.99] Δ no_agent
no_agent 370.102 µs [350.323 µs, 389.881 µs] -
iast 480.12 µs [459.304 µs, 500.935 µs] 110.017 µs (29.7%)
iast_FULL 550.931 µs [530.542 µs, 571.32 µs] 180.829 µs (48.9%)
iast_INACTIVE 453.108 µs [431.291 µs, 474.925 µs] 83.005 µs (22.4%)
iast_TELEMETRY_OFF 476.512 µs [455.231 µs, 497.794 µs] 106.41 µs (28.8%)
tracing 441.924 µs [420.938 µs, 462.911 µs] 71.822 µs (19.4%)

pr-commenter[bot] avatar Jan 18 '24 19:01 pr-commenter[bot]

Kafka / producer-benchmark

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
git_branch master piotr-wolski/add-high-watermark
git_commit_date 1704464857 1705724754
git_commit_sha 260cceba3999a7c1f7bf1ccc2c4023556dca8463 c25937a4cc00aa1acad35023f12760da77f25cff
See matching parameters
Baseline Candidate
ci_job_date 1705726286 1705726286
ci_job_id 413836886 413836886
ci_pipeline_id 26879332 26879332
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
jdkVersion 11.0.21 11.0.21
jmhVersion 1.36 1.36
jvm /usr/lib/jvm/java-11-openjdk-amd64/bin/java /usr/lib/jvm/java-11-openjdk-amd64/bin/java
jvmArgs -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/go/src/github.com/DataDog/apm-reliability/dd-trace-java/platform/src/producer-benchmark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/go/src/github.com/DataDog/apm-reliability/dd-trace-java/platform/src/producer-benchmark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant
vmName OpenJDK 64-Bit Server VM OpenJDK 64-Bit Server VM
vmVersion 11.0.21+9-post-Ubuntu-0ubuntu122.04 11.0.21+9-post-Ubuntu-0ubuntu122.04

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 3 metrics, 0 unstable metrics.

See unchanged results
scenario Δ mean throughput
scenario:not-instrumented/KafkaProduceBenchmark.benchProduce unsure
[-44690.245op/s; -3140.556op/s] or [-2.455%; -0.173%]
scenario:only-tracing-dsm-disabled-benchmarks/KafkaProduceBenchmark.benchProduce same
scenario:only-tracing-dsm-enabled-benchmarks/KafkaProduceBenchmark.benchProduce same

pr-commenter[bot] avatar Jan 18 '24 19:01 pr-commenter[bot]

Kafka / consumer-benchmark

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
git_branch master piotr-wolski/add-high-watermark
git_commit_date 1704464857 1705724754
git_commit_sha 260cceba3999a7c1f7bf1ccc2c4023556dca8463 c25937a4cc00aa1acad35023f12760da77f25cff
See matching parameters
Baseline Candidate
ci_job_date 1705726325 1705726325
ci_job_id 413836887 413836887
ci_pipeline_id 26879332 26879332
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
jdkVersion 11.0.21 11.0.21
jmhVersion 1.36 1.36
jvm /usr/lib/jvm/java-11-openjdk-amd64/bin/java /usr/lib/jvm/java-11-openjdk-amd64/bin/java
jvmArgs -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/go/src/github.com/DataDog/apm-reliability/dd-trace-java/platform/src/consumer-benchmark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/go/src/github.com/DataDog/apm-reliability/dd-trace-java/platform/src/consumer-benchmark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant
vmName OpenJDK 64-Bit Server VM OpenJDK 64-Bit Server VM
vmVersion 11.0.21+9-post-Ubuntu-0ubuntu122.04 11.0.21+9-post-Ubuntu-0ubuntu122.04

Summary

Found 0 performance improvements and 1 performance regressions! Performance is the same for 2 metrics, 0 unstable metrics.

scenario Δ mean throughput
scenario:only-tracing-dsm-enabled-benchmarks/KafkaConsumerBenchmark.benchConsume worse
[-19781.802op/s; -6371.470op/s] or [-6.405%; -2.063%]
See unchanged results
scenario Δ mean throughput
scenario:not-instrumented/KafkaConsumerBenchmark.benchConsume same
scenario:only-tracing-dsm-disabled-benchmarks/KafkaConsumerBenchmark.benchConsume same

pr-commenter[bot] avatar Jan 19 '24 05:01 pr-commenter[bot]

But I didn't find a way to hook into a place that is updated regularly.

Did you find a place, but were not able to hook into? Or were not able to find the right place?

PerfectSlayer avatar Jan 22 '24 13:01 PerfectSlayer

But I didn't find a way to hook into a place that is updated regularly.

Did you find a place, but were not able to hook into? Or were not able to find the right place?

Ah, @kr-igor suggested to use reflection to access the high watermark offset. So I did that, and added instrumentation in the same place we capture commit offsets. The benefit is that to compute lag, we need both commit offsets and high watermark offsets, and they are now captured in the same place.

piochelepiotr avatar Jan 22 '24 15:01 piochelepiotr

Closing for now in favor of: https://github.com/DataDog/integrations-core/pull/16889

piochelepiotr avatar Apr 01 '24 20:04 piochelepiotr