skywalking
skywalking copied to clipboard
[BanyanDB] Expose Performance Monitoring Metrics for BanyanDB Server
Description
This issue is created to outline tasks associated with exposing metrics for monitoring the performance of BanyanDB. These metrics are organized into several categories:
- [x] 1. Runtime: Includes the number of goroutines, heap allocations and objects, heap idle and in-use memory, garbage collection metrics (e.g., NumGC, PauseTotalNs, PauseNs, and LastGC), and the Go version.
- [ ] 2. System: Covers CPU usage and information, virtual and swap memory statistics, disk I/O and usage details, network statistics (bytes sent/received, packets, errors, etc.), process information (PID, name, status), host information (hostname, OS, platform, etc.), and system load averages.
- [x] 3. Traffic: Encompasses latency, throughput, and error occurrences of the gRPC server.
- [ ] 4. Query: Features latency, throughput, error occurrences, alive tasks, metadata access statistics (scanning operations and term matching count), and index access statistics (latency, throughput, and error occurrences of the index).
- [x] 5. Write: Consists of latency, throughput, and error occurrences.
- [ ] 6. Storage: Involves buffer idle and in-use memory, flush metrics (e.g., flush failures, latency, and flush count), LSM alive compaction tasks and the number of gets, LSM bloom filter hits, local/global index build counts, and metadata database statistics (series count, etc.).
The latency metric should utilize a histogram as the metric type.
About <4>, it would be interesting if we could build a tracing for query including latency(s) of every phrases of the query execution plan.
@lujiajing1126 do you have any thoughts? particularly regarding the Query catalog.
For more specifically, this tracing could be an internal only (distributed) tracing. Or more, we could build a tracing from OAP(persistent worker only) to cluster BanyanDB.
About <4>, it would be interesting if we could build a tracing for query including latency(s) of every phrases of the query execution plan.
#10561 proposes introducing a tracing system to dive into the query process, which may overlap with the current issue. However, I suggest adding some basic metrics here. The goal of this particular issue is to support the first phase of stress testing which focuses on writing performance and reliability.
When the server becomes stable, we can consider using the internal tracing path. That's because relying on an unstable banyandb server for problem diagnosis is not feasible.
@lujiajing1126 do you have any thoughts? particularly regarding the Query catalog.
In https://github.com/apache/skywalking/issues/10561, we will introduce fine-grained metrics for every stage of the query.
If necessary, for this issue, we may add some coarse-grained metrics for the whole execution. Percentiles (e.g. P99) are sufficient to provide an overview of query performance.
Can I work on a particular subtask from this issue. I am thinking of going with the 1st one as I recently worked with the runtime go package.
Can I work on a particular subtask from this issue. I am thinking of going with the 1st one as I recently worked with the runtime go package.
Nice to see you are interested in this task. You could pick up the query-relevant metrics.
Can I work on a particular subtask from this issue? I am thinking of going with the 1st one as I recently worked with the runtime go package.
Nice to see you are interested in this task. You could pick up the query-relevant metrics.
Sure, I can see the system subtask is also unmarked. Can I take a look at that as well?
@achintya-7 Do you have any updates to share?
@achintya-7 Do you have any updates to share?
I am bit busy with some work. Ill take a look and ask for assignment of issue when im free. Sorry for any mis confusion.