Change instance specific dimensions in aggregated metrics
Currently, instance specific metrics are part of the transaction metrics dimensions, and can lead to time series explosion.
To reduce the number of transaction metric timeseries, we remove the instance-specific dimensions and introduce a new metric set that has instance specific dimensions but not transaction.name.
Required changes:
- review dimensions and propose which ones classify as instance-specific
- remove the instance dimensions from transaction metrics
- create a new metric set with instance dimensions
- update documentation
- update apmpackage
Existing implementation
Transaction metrics contains the dimensions of ServiceAggregationKey + ServiceInstanceAggregationKey + TransactionAggregationKey.
type ServiceAggregationKey struct {
Timestamp time.Time
ServiceName string
ServiceEnvironment string
ServiceLanguageName string
AgentName string
}
type ServiceInstanceAggregationKey struct {
GlobalLabelsStr string
}
type TransactionAggregationKey struct {
TraceRoot bool
ContainerID string
KubernetesPodName string
ServiceVersion string
ServiceNodeName string
ServiceRuntimeName string
ServiceRuntimeVersion string
ServiceLanguageVersion string
HostHostname string
HostName string
HostOSPlatform string
EventOutcome string
TransactionName string
TransactionType string
TransactionResult string
FAASColdstart nullable.Bool
FAASID string
FAASName string
FAASVersion string
FAASTriggerType string
CloudProvider string
CloudRegion string
CloudAvailabilityZone string
CloudServiceName string
CloudAccountID string
CloudAccountName string
CloudMachineType string
CloudProjectID string
CloudProjectName string
}
type TransactionMetrics struct {
Histogram *hdrhistogram.HistogramRepresentation
}
Proposal
ServiceAggregationKey and ServiceInstanceAggregationKey remains unchanged. Remove instance-specific dimensions from TransactionAggregationKey and introduce a new metric called "InstanceMetrics" (TBD, it is actually better called ServiceInstanceMetrics but it is already used in code internally or it will need to be renamed)
New TransactionAggregationKey & TransactionMetrics:
type TransactionAggregationKey struct {
TraceRoot bool
EventOutcome string
TransactionName string
TransactionType string
TransactionResult string
}
type TransactionMetrics struct {
Histogram *hdrhistogram.HistogramRepresentation
}
Newly introduced InstanceAggregationKey & InstanceMetrics:
type InstanceAggregationKey struct {
ContainerID string
KubernetesPodName string
ServiceVersion string
ServiceNodeName string
ServiceRuntimeName string
ServiceRuntimeVersion string
ServiceLanguageVersion string
HostHostname string
HostName string
HostOSPlatform string
EventOutcome string
FAASColdstart nullable.Bool
FAASID string
FAASName string
FAASVersion string
FAASTriggerType string
CloudProvider string
CloudRegion string
CloudAvailabilityZone string
CloudServiceName string
CloudAccountID string
CloudAccountName string
CloudMachineType string
CloudProjectID string
CloudProjectName string
}
type InstanceMetrics struct {
Histogram *hdrhistogram.HistogramRepresentation
}
Questions:
- EventOutcome exists as a dimension in both keys. Does it sound right?
- WDYT about the split?
- WDYT about the name "instance metrics" / "service instance metrics"?
- We could remove the existing internal ServiceInstance* in code as we deprecate global labels as dimensions, then we can rename this as "service instance metrics" if that's preferred.
cc @simitt @axw @dgieselaar
I think all the faas.* fields should be on transaction metrics only. @AlexanderWert would you agree with that?
I believe we still need transaction.type on the instance metrics, as the transaction type dropdown on the service overview page of the UI affects the transaction metrics of the Instances table:
Given this, I'd say the name ought to be service_instance_transaction (service_node_transaction?) metrics; they're essentially the same as service_transaction, but with instance/node dimensions.
EventOutcome exists as a dimension in both keys. Does it sound right?
If we consider the new metrics as being the same as service_transaction but with additional instance dimensions, then I think it would make sense to instead set event.success_count and leave event.outcome out of the dimensions. We would continue to have event.outcome as a dimension for transaction metrics.
I think all the
faas.*fields should be on transaction metrics only. @AlexanderWert would you agree with that?
Yes, that's true.
I remember there were some oddities with FAAS data where some of it was handled differently than regular transactions. I don't immediately remember what it was, but maybe @AlexanderWert knows? The reason why I'm bringing that up is if we want to align FAAS data with regular transaction data, maybe this is an opportunity to do so?
Shouldn't the Service* fields be a dimension on both metrics?
I remember there were some oddities with FAAS data where some of it was handled differently than regular transactions. I don't immediately remember what it was, but maybe @AlexanderWert knows? The reason why I'm bringing that up is if we want to align FAAS data with regular transaction data, maybe this is an opportunity to do so?
For FAAS, all the metrics shown under the Metrics tab (as described here) are coming from separate FAAS metrics that are being collected by the Lambda extension (those are NOT derived from transactions).
Apart from that, we show the ColdstartRateChart instead of the Time spent by span type chart for all usual occurrences of the latter (for that only the faas.coldstart field is required on the transaction metrics).
Apart from the above:
- for FAAS scenarios it's NOT really useful to filter / group data by the instance (i.e.
service.node.name) as the instances come and go (and are mostly transparent to the user anyways). - depending on how customers configure their lambda extension/agent, the fields
faas.idandfaas.namemay correspond to thetransaction.name.
To sum up, I think:
- all the above-mentioned
faas.*fields SHOULD be part of theTransactionAggregationKey(in terms of cardinality those either correspond one to one to theservice.namecardinality or thetransaction.namecardinality, depending on the user's configuration) faas.id,faas.name,faas.versionMUST NOT be part of theInstanceAggregationKey(because of the initially stated goal of: "... introduce a new metric set that has instance specific dimensions but nottransaction.name" andfaas.id,faas.name,faas.versionpotentially corresponding to the transaction name)- Other
faas.*fields wouldn't hurt on theInstanceAggregationKey, but at the same time I don't see value in having them there (for example,faas.coldstartwould be only once true for each of the instances at the very startup of the instance). So, I'd propose to NOT include anyfaas.*fields on theInstanceAggregationKey.
Revised proposal (updated after Andrew's comment)
Existing transaction metrics will be updated to have no instance-specific dimensions and instance-specific dimensions will go into a new metrics, which is called service_instance_transaction metrics.
List of changes vs last proposal:
- Renaming struct name from
InstanceAggregation*toServiceInstanceTransactionAggregation* Service*fields are added back to TransactionAggregationKey such that both metrics contain themFAAS*fields are moved back to TransactionAggregationKey- ServiceInstanceTransaction does not have event.outcome as a dimension but will aggregate on successCount and failureCount to produce event.success_count
- TransactionType field is added to ServiceInstanceTransactionAggregationKey
Updated transaction metrics
type TransactionAggregationKey struct {
TraceRoot bool
EventOutcome string
ServiceVersion string
ServiceRuntimeName string
ServiceRuntimeVersion string
ServiceLanguageVersion string
FAASColdstart nullable.Bool
FAASID string
FAASName string
FAASVersion string
FAASTriggerType string
TransactionName string
TransactionType string
TransactionResult string
}
type TransactionMetrics struct {
Histogram *hdrhistogram.HistogramRepresentation
}
Service instance transaction metrics
type ServiceInstanceTransactionAggregationKey struct {
ContainerID string
KubernetesPodName string
ServiceVersion string
ServiceNodeName string
ServiceRuntimeName string
ServiceRuntimeVersion string
ServiceLanguageVersion string
HostHostname string
HostName string
HostOSPlatform string
CloudProvider string
CloudRegion string
CloudAvailabilityZone string
CloudServiceName string
CloudAccountID string
CloudAccountName string
CloudMachineType string
CloudProjectID string
CloudProjectName string
TransactionType string
}
type ServiceInstanceTransactionMetrics struct {
Histogram *hdrhistogram.HistogramRepresentation
failureCount float64
successCount float64
}
ServiceNodeName is specific to a service instance (service node == service instance), so should only be in ServiceInstanceTransactionAggregationKey
Otherwise I think this looks good, thanks @carsonip
I would like to note that the changes to aggregation will be made in apm-aggregation, not existing aggregation code in apm-server. apm-server will use apm-aggregation after #11117 which is also scheduled to go into 8.10. Therefore, this task will be effectively blocked by it, although the development will happen in parallel.
What's the rationale behind not having the cloud metadata in transaction metrics? Are we expecting the combination of transaction name and regions/availability zones to create too high cardinality?
If users are filtering by AZ, for example, would the UI then use a combination of raw transaction docs for the transaction table and service instance transaction metrics for the instance table?
@dgieselaar @gbamparop could you please review @carsonip 's proposal and address @felixbarny 's question so we can start with the implementation.
What's the rationale behind not having the cloud metadata in transaction metrics? Are we expecting the combination of transaction name and regions/availability zones to create too high cardinality?
I didn't really think through that and assumed that cloud metadata are "instance-specific". Are you proposing that Cloud* dimensions should exist in both metrics?
The cloud metadata seems to be some gray area between instance specific and not instance specific.
I'd expect these pieces of metadata to be fairly static for the same application:
CloudProvider string
CloudServiceName string
CloudAccountID string
CloudAccountName string
CloudProjectID string
CloudProjectName string
Furthermore, I'd expect these to be low cardinality:
CloudRegion string
CloudAvailabilityZone string
I don't know what to expect of this:
CloudMachineType string
I think we should add the pieces of cloud metadata that turns out to be static. We should consider adding cloud metadata that is low cardinality, even if a service has lots of instances. If any piece of metadata is potentially medium to high cardinality, we shouldn't add it to the aggregation key for transaction metrics.
LGTM. Personally I feel more comfortable leaving off cloud metadata. Agree that it is likely low cardinality but conceptually it feels like it belongs on service instance transaction metrics. For simplicity's sake I would not want some of the cloud metadata fields on regular tx metrics, but not all.
I'd also be more comfortable if we leave cloud metadata out for now, unless there's a clear use case for it. @felixbarny did you have any use cases in mind?
I'm good with that. It's always easier to add something to the aggregation key than to remove it.
A potential use case is if you want to compare the performance of a particular transaction group across different regions. Probably not critical. We'll still be able to do that by falling back to transaction events (although with less retention and unless tail-based sampling is uses). But I don't think that's critical. We can also compare across regions with metrics across all transaction groups.
Thanks for confirming. I'll push forward with the revised proposal in my previous comment.
Would like a confirmation about an implementation detail about service instance transaction metrics overflow bucket. Are we also using transaction.type: _other as overflow key for service instance transaction metrics overflow? @dgieselaar WDYT?
@achyutjhunjhunwala do you know the answer to ^ by any chance?
Would like a confirmation about an implementation detail about service instance transaction metrics overflow bucket. Are we also using transaction.type: _other as overflow key for service instance transaction metrics overflow?
I am not sure if i understand the question correctly. We were using transaction.name: _other and service.name: _other for transaction and service metrics. @carsonip Are you proposing to use transaction.type: _other as the overflow key for this new metric given we won't have name?
I am not sure if i understand the question correctly. We were using
transaction.name: _otherandservice.name: _otherfor transaction and service metrics. @carsonip Are you proposing to usetransaction.type: _otheras the overflow key for this new metric given we won't have name?
Correct, that's just a proposal. Wanted to see if you have any problem with it.
@carsonip I am not 100% confident if that's the right key to use for overflow, because transaction.type is a low cardinality field which means chances of overflow is very narrow.
@gbamparop mentioned in https://github.com/elastic/kibana/issues/162392#issuecomment-1662364252 that the APM UI counterpart may miss 8.10.
@carsonip is this the final version that has been implemented?
@carsonip is this the final version that has been implemented?
@gbamparop The changes never got into 8.10 / main. They are still in a now-outdated draft PR: https://github.com/elastic/apm-aggregation/pull/43
The PR will have to be updated (most likely re-implemented due to all the conflicting changes) when the work is scheduled again.
Update: opened https://github.com/elastic/apm-aggregation/pull/103 to reimplement service instance transaction metrics
Discovered an issue with overflow metrics. Should be fixed by https://github.com/elastic/apm-aggregation/pull/106
As seen in the change here by @kruskall , transaction.type: _other is used to denote the overflow bucket of service instance transaction. @achyutjhunjhunwala I know we haven't completed the discussion above, so just bringing it up again to make sure APM UI is fine with this decision.
Overflow issue fixed and the broken test is solved. The PR is working and ready for review: https://github.com/elastic/apm-server/pull/11614
@carsonip I am not 100% sure if transaction.type: _other is the way to go. As i mentioned previously, transaction.type is not a high cardinality field.
We will be querying this metrics to display the Instance table on the Service Overview page like this
This makes me think, should overflow not happen at HostName ?
cc: @dgieselaar