skywalking icon indicating copy to clipboard operation
skywalking copied to clipboard

[BanyanDB] Expose Performance Monitoring Metrics for BanyanDB Server

Open hanahmily opened this issue 1 year ago • 10 comments

Description

This issue is created to outline tasks associated with exposing metrics for monitoring the performance of BanyanDB. These metrics are organized into several categories:

  • [x] 1. Runtime: Includes the number of goroutines, heap allocations and objects, heap idle and in-use memory, garbage collection metrics (e.g., NumGC, PauseTotalNs, PauseNs, and LastGC), and the Go version.
  • [ ] 2. System: Covers CPU usage and information, virtual and swap memory statistics, disk I/O and usage details, network statistics (bytes sent/received, packets, errors, etc.), process information (PID, name, status), host information (hostname, OS, platform, etc.), and system load averages.
  • [x] 3. Traffic: Encompasses latency, throughput, and error occurrences of the gRPC server.
  • [ ] 4. Query: Features latency, throughput, error occurrences, alive tasks, metadata access statistics (scanning operations and term matching count), and index access statistics (latency, throughput, and error occurrences of the index).
  • [x] 5. Write: Consists of latency, throughput, and error occurrences.
  • [ ] 6. Storage: Involves buffer idle and in-use memory, flush metrics (e.g., flush failures, latency, and flush count), LSM alive compaction tasks and the number of gets, LSM bloom filter hits, local/global index build counts, and metadata database statistics (series count, etc.).

The latency metric should utilize a histogram as the metric type.

hanahmily avatar Apr 04 '23 12:04 hanahmily

About <4>, it would be interesting if we could build a tracing for query including latency(s) of every phrases of the query execution plan.

wu-sheng avatar Apr 04 '23 12:04 wu-sheng

@lujiajing1126 do you have any thoughts? particularly regarding the Query catalog.

hanahmily avatar Apr 04 '23 12:04 hanahmily

For more specifically, this tracing could be an internal only (distributed) tracing. Or more, we could build a tracing from OAP(persistent worker only) to cluster BanyanDB.

wu-sheng avatar Apr 04 '23 12:04 wu-sheng

About <4>, it would be interesting if we could build a tracing for query including latency(s) of every phrases of the query execution plan.

#10561 proposes introducing a tracing system to dive into the query process, which may overlap with the current issue. However, I suggest adding some basic metrics here. The goal of this particular issue is to support the first phase of stress testing which focuses on writing performance and reliability.

When the server becomes stable, we can consider using the internal tracing path. That's because relying on an unstable banyandb server for problem diagnosis is not feasible.

hanahmily avatar Apr 04 '23 12:04 hanahmily

@lujiajing1126 do you have any thoughts? particularly regarding the Query catalog.

In https://github.com/apache/skywalking/issues/10561, we will introduce fine-grained metrics for every stage of the query.

If necessary, for this issue, we may add some coarse-grained metrics for the whole execution. Percentiles (e.g. P99) are sufficient to provide an overview of query performance.

lujiajing1126 avatar Apr 04 '23 13:04 lujiajing1126

Can I work on a particular subtask from this issue. I am thinking of going with the 1st one as I recently worked with the runtime go package.

achintya-7 avatar May 20 '23 09:05 achintya-7

Can I work on a particular subtask from this issue. I am thinking of going with the 1st one as I recently worked with the runtime go package.

Nice to see you are interested in this task. You could pick up the query-relevant metrics.

hanahmily avatar May 20 '23 10:05 hanahmily

Can I work on a particular subtask from this issue? I am thinking of going with the 1st one as I recently worked with the runtime go package.

Nice to see you are interested in this task. You could pick up the query-relevant metrics.

Sure, I can see the system subtask is also unmarked. Can I take a look at that as well?

achintya-7 avatar May 20 '23 15:05 achintya-7

@achintya-7 Do you have any updates to share?

hanahmily avatar Jun 07 '23 03:06 hanahmily

@achintya-7 Do you have any updates to share?

I am bit busy with some work. Ill take a look and ask for assignment of issue when im free. Sorry for any mis confusion.

achintya-7 avatar Jun 08 '23 15:06 achintya-7