opentelemetry-collector-contrib icon indicating copy to clipboard operation
opentelemetry-collector-contrib copied to clipboard

[WIP][exporter/elasticsearch] Use exporterhelper/batchsender for reliability

Open carsonip opened this issue 1 year ago • 1 comments

Description: <Describe what has changed.>

WIP.

Blocked by:

  • batch by byte in https://github.com/open-telemetry/opentelemetry-collector/issues/8122
    • batch by byte is an existing capability in elasticsearch exporter. Not supporting it will be a breaking change to remove a feature.
  • https://github.com/open-telemetry/opentelemetry-collector/issues/9952
  • https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/32359

Link to tracking Issue: Fixes #32377

Testing: <Describe what testing was performed and which tests were added.>

Documentation: <Describe the documentation added.>

carsonip avatar Apr 23 '24 10:04 carsonip

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar May 08 '24 05:05 github-actions[bot]

Expecting a lot of README merge conflicts with #33367

carsonip avatar Jun 10 '24 17:06 carsonip

Still needs a bit of work to make the benchmarks work because benchmark measure generated count instead of exported doc count. Since sending_queue is enabled, the rate of Consume* will effectively be inf, until sending queue is full.

carsonip avatar Jun 13 '24 19:06 carsonip

@carsonip should we disable the queue for benchmarks, so they're focused just on the ES exporter perf?

axw avatar Jun 13 '24 23:06 axw

@carsonip should we disable the queue for benchmarks, so they're focused just on the ES exporter perf?

I believe it will still break the benchmarks because batch sender will block the Consume* call, and that the bench is designed to call Consume* in serial instead of in parallel, resulting in only 1 request in each batch flushed every flush_interval (batcher.flush_timeout). Either way the benchmarks will need to be updated.

carsonip avatar Jun 14 '24 10:06 carsonip

benchmark measure generated count instead of exported doc count

IIRC, I did this because I wanted to not decode the bulk request on ES side polluting the benchmark profiling. The benchmarks assumes that all other tests pass (i.e. the exporter is in working condition) and calculates events/s using the total events generated and the total bench time.

the rate of Consume* will effectively be inf, until sending queue is full.

Will it be possible to monitor the state of the queue? If we are able to wait till the queue is empty at the end of the benchmark then we should be golden.

should we disable the queue for benchmarks, so they're focused just on the ES exporter perf?

+1 on this. I think we should have the benchmarks for both cases (with and without the queue) given we can make the queue benchmark report correct numbers by waiting for the queue to be empty as a signal to end the benchmarks.

lahsivjar avatar Jun 17 '24 09:06 lahsivjar

Benchmark result compared to main with #33621 :

goos: linux
goarch: amd64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/elasticsearchexporter/integrationtest
cpu: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
                                     │   main.out   │            batchsender.out            │
                                     │    sec/op    │    sec/op      vs base                │
Exporter/logs/none/small_batch-16      31.65µ ±  1%    67.06µ ± 31%  +111.89% (p=0.002 n=6)
Exporter/logs/none/medium_batch-16     309.0µ ±  5%    670.9µ ± 57%  +117.11% (p=0.002 n=6)
Exporter/logs/none/large_batch-16      3.012m ±  6%   10.727m ±  2%  +256.17% (p=0.002 n=6)
Exporter/logs/none/xlarge_batch-16     34.74m ±  2%   107.16m ±  1%  +208.47% (p=0.002 n=6)
Exporter/logs/ecs/small_batch-16       28.29µ ±  3%    47.77µ ± 48%   +68.89% (p=0.002 n=6)
Exporter/logs/ecs/medium_batch-16      278.8µ ± 11%    824.9µ ± 43%  +195.86% (p=0.002 n=6)
Exporter/logs/ecs/large_batch-16       2.811m ±  7%    8.381m ±  2%  +198.11% (p=0.002 n=6)
Exporter/logs/ecs/xlarge_batch-16      33.28m ±  3%    82.66m ±  3%  +148.39% (p=0.002 n=6)
Exporter/logs/raw/small_batch-16       32.07µ ±  6%    61.46µ ± 18%   +91.62% (p=0.002 n=6)
Exporter/logs/raw/medium_batch-16      306.6µ ±  4%    971.4µ ± 45%  +216.82% (p=0.002 n=6)
Exporter/logs/raw/large_batch-16       2.987m ±  5%    9.450m ±  3%  +216.43% (p=0.002 n=6)
Exporter/logs/raw/xlarge_batch-16      34.73m ±  2%    95.05m ±  2%  +173.73% (p=0.002 n=6)
Exporter/traces/none/small_batch-16    31.44µ ±  8%    69.13µ ± 37%  +119.91% (p=0.002 n=6)
Exporter/traces/none/medium_batch-16   316.8µ ± 10%    689.0µ ± 63%  +117.46% (p=0.002 n=6)
Exporter/traces/none/large_batch-16    3.181m ±  5%   11.444m ±  2%  +259.81% (p=0.002 n=6)
Exporter/traces/none/xlarge_batch-16   33.99m ±  3%   113.22m ±  4%  +233.06% (p=0.002 n=6)
Exporter/traces/ecs/small_batch-16     31.39µ ±  3%    67.03µ ± 22%  +113.56% (p=0.002 n=6)
Exporter/traces/ecs/medium_batch-16    299.5µ ±  6%    687.8µ ± 55%  +129.61% (p=0.002 n=6)
Exporter/traces/ecs/large_batch-16     3.054m ±  3%   11.368m ±  2%  +272.20% (p=0.002 n=6)
Exporter/traces/ecs/xlarge_batch-16    33.63m ±  2%   112.89m ±  4%  +235.67% (p=0.002 n=6)
Exporter/traces/raw/small_batch-16     30.31µ ±  2%    62.30µ ± 48%  +105.58% (p=0.002 n=6)
Exporter/traces/raw/medium_batch-16    287.2µ ±  3%    620.2µ ± 74%  +115.94% (p=0.002 n=6)
Exporter/traces/raw/large_batch-16     3.176m ±  9%   10.662m ±  8%  +235.67% (p=0.002 n=6)
Exporter/traces/raw/xlarge_batch-16    34.15m ±  4%   106.13m ±  4%  +210.80% (p=0.002 n=6)
geomean                                988.4µ          2.631m        +166.17%

                                     │   main.out    │           batchsender.out            │
                                     │    docs/s     │    docs/s      vs base               │
Exporter/logs/none/small_batch-16       316.0k ±  1%   149.1k ±  24%  -52.80% (p=0.002 n=6)
Exporter/logs/none/medium_batch-16      323.6k ±  5%   123.9k ±  67%  -61.72% (p=0.002 n=6)
Exporter/logs/none/large_batch-16      332.04k ±  6%   55.22k ±   1%  -83.37% (p=0.002 n=6)
Exporter/logs/none/xlarge_batch-16     287.87k ±  2%   48.53k ±   8%  -83.14% (p=0.002 n=6)
Exporter/logs/ecs/small_batch-16        353.5k ±  3%   209.5k ±  33%  -40.74% (p=0.002 n=6)
Exporter/logs/ecs/medium_batch-16      358.69k ± 10%   70.51k ± 135%  -80.34% (p=0.002 n=6)
Exporter/logs/ecs/large_batch-16       355.72k ±  6%   58.47k ±   2%  -83.56% (p=0.002 n=6)
Exporter/logs/ecs/xlarge_batch-16      300.55k ±  3%   56.49k ±  17%  -81.21% (p=0.002 n=6)
Exporter/logs/raw/small_batch-16        312.1k ±  7%   162.7k ±  22%  -47.87% (p=0.002 n=6)
Exporter/logs/raw/medium_batch-16      326.14k ±  3%   60.88k ± 160%  -81.33% (p=0.002 n=6)
Exporter/logs/raw/large_batch-16       334.88k ±  5%   59.23k ±  45%  -82.31% (p=0.002 n=6)
Exporter/logs/raw/xlarge_batch-16      287.97k ±  2%   53.11k ±   2%  -81.56% (p=0.002 n=6)
Exporter/traces/none/small_batch-16     318.1k ±  7%   144.7k ±  27%  -54.52% (p=0.002 n=6)
Exporter/traces/none/medium_batch-16   315.62k ±  9%   98.08k ±  61%  -68.93% (p=0.002 n=6)
Exporter/traces/none/large_batch-16    314.46k ±  5%   42.12k ±   5%  -86.60% (p=0.002 n=6)
Exporter/traces/none/xlarge_batch-16   294.19k ±  3%   45.93k ±   7%  -84.39% (p=0.002 n=6)
Exporter/traces/ecs/small_batch-16      318.6k ±  3%   149.3k ±  22%  -53.14% (p=0.002 n=6)
Exporter/traces/ecs/medium_batch-16    333.84k ±  5%   97.21k ±  39%  -70.88% (p=0.002 n=6)
Exporter/traces/ecs/large_batch-16     327.41k ±  3%   41.61k ±   2%  -87.29% (p=0.002 n=6)
Exporter/traces/ecs/xlarge_batch-16    297.34k ±  2%   44.76k ±   9%  -84.95% (p=0.002 n=6)
Exporter/traces/raw/small_batch-16      330.0k ±  2%   160.5k ±  32%  -51.35% (p=0.002 n=6)
Exporter/traces/raw/medium_batch-16     348.2k ±  3%   113.8k ±  61%  -67.32% (p=0.002 n=6)
Exporter/traces/raw/large_batch-16     315.00k ± 10%   45.11k ±   8%  -85.68% (p=0.002 n=6)
Exporter/traces/raw/xlarge_batch-16    292.86k ±  4%   40.81k ±   4%  -86.07% (p=0.002 n=6)
geomean                                 320.0k         76.98k         -75.94%

                                     │   main.out    │           batchsender.out           │
                                     │   events/s    │   events/s    vs base               │
Exporter/logs/none/small_batch-16       316.0k ±  1%   149.1k ± 24%  -52.80% (p=0.002 n=6)
Exporter/logs/none/medium_batch-16      323.6k ±  5%   149.3k ± 36%  -53.87% (p=0.002 n=6)
Exporter/logs/none/large_batch-16      332.04k ±  6%   93.23k ±  2%  -71.92% (p=0.002 n=6)
Exporter/logs/none/xlarge_batch-16     287.87k ±  2%   93.32k ±  1%  -67.58% (p=0.002 n=6)
Exporter/logs/ecs/small_batch-16        353.5k ±  3%   209.5k ± 33%  -40.74% (p=0.002 n=6)
Exporter/logs/ecs/medium_batch-16       358.7k ± 10%   121.2k ± 76%  -66.20% (p=0.002 n=6)
Exporter/logs/ecs/large_batch-16        355.7k ±  6%   119.3k ±  2%  -66.46% (p=0.002 n=6)
Exporter/logs/ecs/xlarge_batch-16       300.6k ±  3%   121.0k ±  3%  -59.75% (p=0.002 n=6)
Exporter/logs/raw/small_batch-16        312.1k ±  7%   162.7k ± 22%  -47.87% (p=0.002 n=6)
Exporter/logs/raw/medium_batch-16       326.1k ±  3%   102.9k ± 81%  -68.44% (p=0.002 n=6)
Exporter/logs/raw/large_batch-16        334.9k ±  5%   105.8k ±  3%  -68.40% (p=0.002 n=6)
Exporter/logs/raw/xlarge_batch-16       288.0k ±  2%   105.2k ±  2%  -63.47% (p=0.002 n=6)
Exporter/traces/none/small_batch-16     318.1k ±  7%   144.7k ± 27%  -54.52% (p=0.002 n=6)
Exporter/traces/none/medium_batch-16    315.6k ±  9%   146.0k ± 39%  -53.74% (p=0.002 n=6)
Exporter/traces/none/large_batch-16    314.46k ±  5%   87.38k ±  2%  -72.21% (p=0.002 n=6)
Exporter/traces/none/xlarge_batch-16   294.19k ±  3%   88.32k ±  4%  -69.98% (p=0.002 n=6)
Exporter/traces/ecs/small_batch-16      318.6k ±  3%   149.3k ± 22%  -53.14% (p=0.002 n=6)
Exporter/traces/ecs/medium_batch-16     333.8k ±  5%   145.4k ± 35%  -56.45% (p=0.002 n=6)
Exporter/traces/ecs/large_batch-16     327.41k ±  3%   87.97k ±  2%  -73.13% (p=0.002 n=6)
Exporter/traces/ecs/xlarge_batch-16    297.34k ±  2%   88.59k ±  4%  -70.21% (p=0.002 n=6)
Exporter/traces/raw/small_batch-16      330.0k ±  2%   160.5k ± 32%  -51.35% (p=0.002 n=6)
Exporter/traces/raw/medium_batch-16     348.2k ±  3%   161.3k ± 42%  -53.69% (p=0.002 n=6)
Exporter/traces/raw/large_batch-16     315.00k ± 10%   93.79k ±  9%  -70.22% (p=0.002 n=6)
Exporter/traces/raw/xlarge_batch-16    292.86k ±  4%   94.24k ±  4%  -67.82% (p=0.002 n=6)
geomean                                 320.0k         120.3k        -62.42%

                                     │   main.out   │           batchsender.out           │
                                     │     B/op     │     B/op      vs base               │
Exporter/logs/none/small_batch-16      61.69Ki ± 0%   71.60Ki ± 1%  +16.07% (p=0.002 n=6)
Exporter/logs/none/medium_batch-16     610.5Ki ± 0%   728.1Ki ± 4%  +19.26% (p=0.002 n=6)
Exporter/logs/none/large_batch-16      5.974Mi ± 0%   7.079Mi ± 0%  +18.50% (p=0.002 n=6)
Exporter/logs/none/xlarge_batch-16     59.48Mi ± 0%   69.75Mi ± 1%  +17.28% (p=0.002 n=6)
Exporter/logs/ecs/small_batch-16       53.24Ki ± 0%   61.58Ki ± 2%  +15.65% (p=0.002 n=6)
Exporter/logs/ecs/medium_batch-16      526.2Ki ± 0%   642.3Ki ± 4%  +22.07% (p=0.002 n=6)
Exporter/logs/ecs/large_batch-16       5.135Mi ± 0%   6.440Mi ± 0%  +25.42% (p=0.002 n=6)
Exporter/logs/ecs/xlarge_batch-16      51.26Mi ± 0%   63.91Mi ± 7%  +24.67% (p=0.002 n=6)
Exporter/logs/raw/small_batch-16       60.29Ki ± 0%   69.71Ki ± 2%  +15.62% (p=0.002 n=6)
Exporter/logs/raw/medium_batch-16      598.1Ki ± 0%   704.8Ki ± 3%  +17.84% (p=0.002 n=6)
Exporter/logs/raw/large_batch-16       5.849Mi ± 0%   6.836Mi ± 2%  +16.87% (p=0.002 n=6)
Exporter/logs/raw/xlarge_batch-16      58.32Mi ± 0%   67.95Mi ± 0%  +16.50% (p=0.002 n=6)
Exporter/traces/none/small_batch-16    63.12Ki ± 0%   77.48Ki ± 2%  +22.75% (p=0.002 n=6)
Exporter/traces/none/medium_batch-16   625.5Ki ± 0%   786.8Ki ± 2%  +25.80% (p=0.002 n=6)
Exporter/traces/none/large_batch-16    6.120Mi ± 0%   7.693Mi ± 1%  +25.71% (p=0.002 n=6)
Exporter/traces/none/xlarge_batch-16   61.18Mi ± 0%   78.36Mi ± 1%  +28.07% (p=0.002 n=6)
Exporter/traces/ecs/small_batch-16     63.16Ki ± 0%   76.90Ki ± 2%  +21.75% (p=0.002 n=6)
Exporter/traces/ecs/medium_batch-16    625.5Ki ± 0%   796.8Ki ± 4%  +27.38% (p=0.002 n=6)
Exporter/traces/ecs/large_batch-16     6.136Mi ± 0%   7.705Mi ± 0%  +25.56% (p=0.002 n=6)
Exporter/traces/ecs/xlarge_batch-16    61.12Mi ± 0%   77.64Mi ± 2%  +27.04% (p=0.002 n=6)
Exporter/traces/raw/small_batch-16     62.26Ki ± 1%   75.88Ki ± 3%  +21.89% (p=0.002 n=6)
Exporter/traces/raw/medium_batch-16    616.2Ki ± 0%   770.6Ki ± 2%  +25.04% (p=0.002 n=6)
Exporter/traces/raw/large_batch-16     6.014Mi ± 0%   7.534Mi ± 2%  +25.26% (p=0.002 n=6)
Exporter/traces/raw/xlarge_batch-16    60.22Mi ± 0%   74.45Mi ± 0%  +23.62% (p=0.002 n=6)
geomean                                1.856Mi        2.261Mi       +21.83%

                                     │  main.out   │          batchsender.out          │
                                     │  allocs/op  │  allocs/op   vs base              │
Exporter/logs/none/small_batch-16       350.0 ± 0%    342.0 ± 0%  -2.29% (p=0.002 n=6)
Exporter/logs/none/medium_batch-16     3.417k ± 0%   3.404k ± 0%  -0.37% (p=0.002 n=6)
Exporter/logs/none/large_batch-16      34.07k ± 0%   34.03k ± 0%  -0.11% (p=0.002 n=6)
Exporter/logs/none/xlarge_batch-16     340.8k ± 0%   340.3k ± 0%  -0.14% (p=0.002 n=6)
Exporter/logs/ecs/small_batch-16        190.0 ± 0%    182.0 ± 0%  -4.21% (p=0.002 n=6)
Exporter/logs/ecs/medium_batch-16      1.816k ± 0%   1.804k ± 0%  -0.66% (p=0.002 n=6)
Exporter/logs/ecs/large_batch-16       18.07k ± 0%   18.03k ± 0%  -0.26% (p=0.002 n=6)
Exporter/logs/ecs/xlarge_batch-16      180.8k ± 0%   180.2k ± 0%  -0.29% (p=0.002 n=6)
Exporter/logs/raw/small_batch-16        290.0 ± 0%    282.0 ± 0%  -2.76% (p=0.002 n=6)
Exporter/logs/raw/medium_batch-16      2.814k ± 0%   2.797k ± 0%  -0.60% (p=0.002 n=6)
Exporter/logs/raw/large_batch-16       28.05k ± 0%   27.95k ± 0%  -0.33% (p=0.002 n=6)
Exporter/logs/raw/xlarge_batch-16      280.7k ± 0%   280.1k ± 0%  -0.21% (p=0.002 n=6)
Exporter/traces/none/small_batch-16     390.0 ± 0%    382.0 ± 0%  -2.05% (p=0.002 n=6)
Exporter/traces/none/medium_batch-16   3.817k ± 0%   3.805k ± 0%  -0.31% (p=0.002 n=6)
Exporter/traces/none/large_batch-16    38.08k ± 0%   38.04k ± 0%  -0.11% (p=0.002 n=6)
Exporter/traces/none/xlarge_batch-16   380.8k ± 0%   380.4k ± 0%  -0.10% (p=0.002 n=6)
Exporter/traces/ecs/small_batch-16      390.0 ± 0%    382.0 ± 0%  -2.05% (p=0.002 n=6)
Exporter/traces/ecs/medium_batch-16    3.817k ± 0%   3.805k ± 0%  -0.31% (p=0.002 n=6)
Exporter/traces/ecs/large_batch-16     38.08k ± 0%   38.04k ± 0%  -0.10% (p=0.002 n=6)
Exporter/traces/ecs/xlarge_batch-16    380.8k ± 0%   380.4k ± 0%  -0.11% (p=0.002 n=6)
Exporter/traces/raw/small_batch-16      370.0 ± 0%    362.0 ± 0%  -2.16% (p=0.002 n=6)
Exporter/traces/raw/medium_batch-16    3.616k ± 0%   3.605k ± 0%  -0.30% (p=0.002 n=6)
Exporter/traces/raw/large_batch-16     36.08k ± 0%   36.04k ± 0%  -0.13% (p=0.002 n=6)
Exporter/traces/raw/xlarge_batch-16    360.8k ± 0%   360.3k ± 0%  -0.14% (p=0.002 n=6)
geomean                                9.916k        9.832k       -0.84%

Also an important point to note: with this PR the docs/s is not always equal to events/s, which may indicate a bug. Will dig into it later.

carsonip avatar Jun 18 '24 11:06 carsonip

Also an important point to note: with this PR the docs/s is not always equal to events/s, which may indicate a bug. Will dig into it later.

There was an uncaught http request body too large due to receiver default max request bytes settings. Fixed by https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/32632/commits/4006cd9699ba054a2ee35e80f9e808511a9c6872

carsonip avatar Jun 24 '24 23:06 carsonip

Benchmark result compared to main with https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/33621 :

I looked into the perf regression and figured out for root cause of it. It is because on main, events are first encoded, then added to bulk indexer for batching. When run with RunParallel in #33621 , events in different Consume* calls are encoded in parallel. On the other hand, with this batch sender PR, events are first batched, then encoded by a single goroutine. This will under-utilize the CPU when GOMAXPROCS>1.

carsonip avatar Jun 25 '24 13:06 carsonip

playing devil's advocate: do we want to fully use CPUs in the exporter side, or is there an advantage to tying up only one goroutine on encode? Are there benefits to leaving CPU on the table for other processors etc, or reduced memory usage by consolidating encoding?

codefromthecrypt avatar Jun 26 '24 00:06 codefromthecrypt

sorry ignore the question about reduced memory usage as the bench already has bytes, allocs in it.

codefromthecrypt avatar Jun 26 '24 00:06 codefromthecrypt

playing devil's advocate: do we want to fully use CPUs in the exporter side, or is there an advantage to tying up only one goroutine on encode? Are there benefits to leaving CPU on the table for other processors etc, or reduced memory usage by consolidating encoding?

@codefromthecrypt That's actually a very good question. I could have squeezed every bit of cpu and performance by doing encoding of each log record / span in a separate goroutine in parallel, which would "fix" the "perf regression" mentioned above. But the reality is that the encoding work itself is inefficient, and that is the actual part that needs to be fixed, before we try to mask the inefficiency by parallelizing inefficient encoding work.

That's also why, after discussing with the team, this PR will stay as is, and the perf improvements will come later as follow-up PRs.

carsonip avatar Jun 26 '24 10:06 carsonip

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Jul 31 '24 05:07 github-actions[bot]

Closed as inactive. Feel free to reopen if this PR is still being worked on.

github-actions[bot] avatar Aug 15 '24 05:08 github-actions[bot]

Superseded by #34238

carsonip avatar Aug 15 '24 09:08 carsonip