[cluster-trace-microservices-v2022] corruptions and inconsistencies in the data
Hello, I am experiencing some issues with the [cluster-trace-microservices-v2022] data. Some are less critical than others, but I wanted to be as thorough as possible to report the information.
I will be fairly thorough and brief about each one, and will go into more detail about each point tomorrow (the ways I have explored to resolve them, and whether or not they have been successful).
- CallGraph : Some lines have two fields for rpc_id instead of one.
- CallGraph : Some lines are duplicated according the key (traceid, service, rcp_id)
- CallGraph : Some lines are with rt = None
- Some files, especially MSRTMCR files, have a lot of duplicates.
- MSRTMCR : Many msinstanceids do not have msname as a prefix.
- MSRTMCR : Some prefixes of msinstanceid do not even exist as msname in the data.
- MSRTMCR - Some pods are on two nodes
- Many pods are not available for all file types.
(coincidence_0.csv is a file that I generate, which indicates, for each pod present on at least MSMetrics, MSRTMCR or CallGraph (over a period of 200 minutes, i.e. MSMetrics_0, and MSRTMCR and CallGraphs from 0 to 9), on which it is present.)
- Inconsistency in use
Some of these problems are not very complicated to fix, such as lines with rt = None that always have fields set to UNKNOW or USER, or the duplication of rpc_id, where it seems most consistent to ignore the line completely if both rpc_ids have the same depth, otherwise keep only the deepest one.
However, some are much more critical for my use, such as inconsistencies in consumption, the fact that most containers have their data either on CallGraph, MSRTMCR or MSResource, rarely all three, etc.
I am looking on my end to correct what I can in order to make the data usable for my case. If I could have access to information about the reasons for these problems, either from someone internal to Alibaba who contributed to the creation of this data, or from someone external who has had the same issues, it would be a great help.
- There are two types of cases where there are two rpc_ids.
One is when both are of equal depth and communicate with the UNKNOWN_POD_6387478 pod, which appears to be an artefact of the trace system. In this case, I recommend ignoring it completely, as this is the most consistent way to avoid duplication of rpc_id on the same trace.
The second case is where the second one is deeper, after that, and I recommend keeping only the second one, which is more consistent and avoids duplication of rpc_id on the same trace.
-
To manage ‘identical’ lines (same service, trace, rpc_id, very similar timestamp, etc.), you can either merge them naively or keep everything, depending on how you want to use the data.
-
Lines with rt = None contain very little information, just the rpc_id, service, trace and timestamp. Similarly, they should be ignored in most cases.
-
Line redundancy is not critical for processing, although removing duplicates could reduce the size of the trace downloads and the space they take up on Alibaba's servers.
-
See 6.
-
At first, it seemed logical to correct the msname based on the prefix of msinstanceid, but given the problems that followed, I am far from certain that this approach is the most accurate.
-
For pods on two nodes in MSRTMCR, we can rely on MSMetrics to choose the node, as MSMetrics does not have this problem with pods on two nodes. However, as some pods are represented in MSRTMCR and not in MSMetrics, this approach is limited.
-
See 9.
-
These are the two issues that are holding me back and currently preventing me from using the data correctly. I don't see any way to correct the data so that most pods are represented in the three types of files concerned, nor how to correct consumption so that it is consistent. If anyone has any information on this subject, please reply to this ticket.