skywalking icon indicating copy to clipboard operation
skywalking copied to clipboard

[Bug] Incorrect Timestamp used in BanyanDB

Open wu-sheng opened this issue 2 years ago • 5 comments

Search before asking

  • [X] I had searched in the issues and found no similar issues.

Apache SkyWalking Component

BanyanDB (apache/skywalking-banyandb)

What happened

BanyanDB used the measure/stream receiving timestamp as its timestamp. When in the query stage, there are several cases, in which you can't read the data correctly.

  1. OAP has 20-30s delay to build bulk data, so when do T1-T10 metrics query, you always can't T10 metrics correctly. Because the latest T10 value is updated in T10 + N seconds
  2. Similar to <1>, the traces and logs have a delay in persistent. So, you always can't get the result correctly. And more importantly, the UI is showing the start time of the traces/logs, you would see traces and logs not in the query condition. This would be treated as bugs.

What you expected to happen

We should use OAP persistent kernel to assign the timestamp for persistent, rather than using a timestamp by default.

How to reproduce

Generate a trace/metric/log, and put them into the BanyanDB several minutes later, but query in the trace running duration.

Anything else

No response

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

wu-sheng avatar Nov 23 '22 01:11 wu-sheng

BanyanDB leverage the below timestamp fields defined in the writing spec to identify the data.

  1. The metrics use a "time bucket" to set this field
  2. Trace segment's timestamp should be picked up.

We need more investigations on this issue.

hanahmily avatar Nov 23 '22 02:11 hanahmily

There are at least two conclusions based on today's discussion

  1. Timestamp should be added as a field of Record in the OAP. All record implementation should provide an accurate timestamp. @wankai123
  2. BanyanDB server should maintain a snowflake-style ID generator. A record ID should constitute by record name + timestamp(not time bucket) + server-generated suffix. The server-generated suffix should be held for every sharding ID in every several seconds/minutes(time window).

The reason for setting the rolling time bucket is to avoid conflict when a new record carries an old timestamp, and the suffix ID just rolls back to 0.

  • Trace ID-1, Service A, Timestamp-1, suffix-0
  • Trace ID-2, Service A, Timestamp-2, suffix-1
  • Trace ID-3, Service A, Timestamp-3, suffix-2 ...
  • Trace ID-1000, Service A, Timestamp-1000, suffix-1000
  • Trace ID-1001, Service A, Timestamp-1, suffix-0

Trace ID-1 would be overridden by Trace ID-1001 unexpectedly. The suffix generator should make sure it would not roll back to zero in the given time window. Also, the time window should not be too small to make too much payload to hold many generators.

wu-sheng avatar Nov 23 '22 06:11 wu-sheng

Now, the column-based engine has fixed the problem mentioned differently:

The Measure engine now assigns a version number to each part, so every bulk flush of OAP will be identified by this version. The version number will be larger for T10 than T1. In case of any duplicate data points, T10 will take precedence and T1 will be ignored. The deduplication process takes place during the querying and merging process.

The stream engine now has the capability to store duplicated data. If two elements with identical timestamps are present in the same series, they will both be saved and can be retrieved by the client. Currently, we do not support any deduplication technology (although some other database systems do offer this feature during querying). However, we may consider adding support for it in the future.

@wu-sheng Would you review the solution?

hanahmily avatar Jan 03 '24 22:01 hanahmily

@hanahmily I think it is fine to pick the latest data for a specific data point. Is this solution majorly for measure data? The repeatable write for a certain data point should only happen for measure data, right?

wu-sheng avatar Jan 04 '24 00:01 wu-sheng

@hanahmily I think it is fine to pick the latest data for a specific data point. Is this solution majorly for measure data? The repeatable write for a certain data point should only happen for measure data, right?

Exactly. We only deduplicate data for the measure for now. FWIK, the repeatable data only happens in the measure data flushing in the same time bucket.

hanahmily avatar Jan 04 '24 00:01 hanahmily

Since the column-based engine is merged, I closed this issue.

hanahmily avatar Feb 24 '24 05:02 hanahmily