apm-server Change instance specific dimensions in aggregated metrics

Currently, instance specific metrics are part of the transaction metrics dimensions, and can lead to time series explosion.

To reduce the number of transaction metric timeseries, we remove the instance-specific dimensions and introduce a new metric set that has instance specific dimensions but not transaction.name.

Required changes:

review dimensions and propose which ones classify as instance-specific
remove the instance dimensions from transaction metrics
create a new metric set with instance dimensions
update documentation
update apmpackage

Jul 24 '23 07:07 simitt

Existing implementation

Transaction metrics contains the dimensions of ServiceAggregationKey + ServiceInstanceAggregationKey + TransactionAggregationKey.

type ServiceAggregationKey struct {
	Timestamp           time.Time
	ServiceName         string
	ServiceEnvironment  string
	ServiceLanguageName string
	AgentName           string
}

type ServiceInstanceAggregationKey struct {
	GlobalLabelsStr string
}

type TransactionAggregationKey struct {
	TraceRoot bool

	ContainerID       string
	KubernetesPodName string

	ServiceVersion  string
	ServiceNodeName string

	ServiceRuntimeName     string
	ServiceRuntimeVersion  string
	ServiceLanguageVersion string

	HostHostname   string
	HostName       string
	HostOSPlatform string

	EventOutcome string

	TransactionName   string
	TransactionType   string
	TransactionResult string

	FAASColdstart   nullable.Bool
	FAASID          string
	FAASName        string
	FAASVersion     string
	FAASTriggerType string

	CloudProvider         string
	CloudRegion           string
	CloudAvailabilityZone string
	CloudServiceName      string
	CloudAccountID        string
	CloudAccountName      string
	CloudMachineType      string
	CloudProjectID        string
	CloudProjectName      string
}

type TransactionMetrics struct {
	Histogram *hdrhistogram.HistogramRepresentation
}

Proposal

ServiceAggregationKey and ServiceInstanceAggregationKey remains unchanged. Remove instance-specific dimensions from TransactionAggregationKey and introduce a new metric called "InstanceMetrics" (TBD, it is actually better called ServiceInstanceMetrics but it is already used in code internally or it will need to be renamed)

New TransactionAggregationKey & TransactionMetrics:

type TransactionAggregationKey struct {
	TraceRoot bool

	EventOutcome string

	TransactionName   string
	TransactionType   string
	TransactionResult string
}

type TransactionMetrics struct {
	Histogram *hdrhistogram.HistogramRepresentation
}

Newly introduced InstanceAggregationKey & InstanceMetrics:

type InstanceAggregationKey struct {
	ContainerID       string
	KubernetesPodName string

	ServiceVersion  string
	ServiceNodeName string

	ServiceRuntimeName     string
	ServiceRuntimeVersion  string
	ServiceLanguageVersion string

	HostHostname   string
	HostName       string
	HostOSPlatform string

	EventOutcome string

	FAASColdstart   nullable.Bool
	FAASID          string
	FAASName        string
	FAASVersion     string
	FAASTriggerType string

	CloudProvider         string
	CloudRegion           string
	CloudAvailabilityZone string
	CloudServiceName      string
	CloudAccountID        string
	CloudAccountName      string
	CloudMachineType      string
	CloudProjectID        string
	CloudProjectName      string
}

type InstanceMetrics struct {
	Histogram *hdrhistogram.HistogramRepresentation
}

Questions:

EventOutcome exists as a dimension in both keys. Does it sound right?
WDYT about the split?
WDYT about the name "instance metrics" / "service instance metrics"?
We could remove the existing internal ServiceInstance* in code as we deprecate global labels as dimensions, then we can rename this as "service instance metrics" if that's preferred.

cc @simitt @axw @dgieselaar

Jul 25 '23 15:07 carsonip

I think all the faas.* fields should be on transaction metrics only. @AlexanderWert would you agree with that?

I believe we still need transaction.type on the instance metrics, as the transaction type dropdown on the service overview page of the UI affects the transaction metrics of the Instances table:

Given this, I'd say the name ought to be service_instance_transaction (service_node_transaction?) metrics; they're essentially the same as service_transaction, but with instance/node dimensions.

EventOutcome exists as a dimension in both keys. Does it sound right?

If we consider the new metrics as being the same as service_transaction but with additional instance dimensions, then I think it would make sense to instead set event.success_count and leave event.outcome out of the dimensions. We would continue to have event.outcome as a dimension for transaction metrics.

Jul 26 '23 03:07 axw

I think all the faas.* fields should be on transaction metrics only. @AlexanderWert would you agree with that?

Yes, that's true.

Jul 26 '23 05:07 AlexanderWert

I remember there were some oddities with FAAS data where some of it was handled differently than regular transactions. I don't immediately remember what it was, but maybe @AlexanderWert knows? The reason why I'm bringing that up is if we want to align FAAS data with regular transaction data, maybe this is an opportunity to do so?

Jul 26 '23 06:07 dgieselaar

Shouldn't the Service* fields be a dimension on both metrics?

Jul 26 '23 06:07 simitt

I remember there were some oddities with FAAS data where some of it was handled differently than regular transactions. I don't immediately remember what it was, but maybe @AlexanderWert knows? The reason why I'm bringing that up is if we want to align FAAS data with regular transaction data, maybe this is an opportunity to do so?

For FAAS, all the metrics shown under the Metrics tab (as described here) are coming from separate FAAS metrics that are being collected by the Lambda extension (those are NOT derived from transactions).

Apart from that, we show the ColdstartRateChart instead of the Time spent by span type chart for all usual occurrences of the latter (for that only the faas.coldstart field is required on the transaction metrics).

Apart from the above:

for FAAS scenarios it's NOT really useful to filter / group data by the instance (i.e. service.node.name) as the instances come and go (and are mostly transparent to the user anyways).
depending on how customers configure their lambda extension/agent, the fields faas.id and faas.name may correspond to the transaction.name.

To sum up, I think:

all the above-mentioned faas.* fields SHOULD be part of the TransactionAggregationKey (in terms of cardinality those either correspond one to one to the service.name cardinality or the transaction.name cardinality, depending on the user's configuration)
faas.id, faas.name, faas.version MUST NOT be part of the InstanceAggregationKey (because of the initially stated goal of: "... introduce a new metric set that has instance specific dimensions but not transaction.name" and faas.id, faas.name, faas.version potentially corresponding to the transaction name)
Other faas.* fields wouldn't hurt on the InstanceAggregationKey, but at the same time I don't see value in having them there (for example, faas.coldstart would be only once true for each of the instances at the very startup of the instance). So, I'd propose to NOT include any faas.* fields on the InstanceAggregationKey.

Jul 26 '23 07:07 AlexanderWert

Revised proposal (updated after Andrew's comment)

Existing transaction metrics will be updated to have no instance-specific dimensions and instance-specific dimensions will go into a new metrics, which is called service_instance_transaction metrics.

List of changes vs last proposal:

Renaming struct name from InstanceAggregation* to ServiceInstanceTransactionAggregation*
Service* fields are added back to TransactionAggregationKey such that both metrics contain them
FAAS* fields are moved back to TransactionAggregationKey
ServiceInstanceTransaction does not have event.outcome as a dimension but will aggregate on successCount and failureCount to produce event.success_count
TransactionType field is added to ServiceInstanceTransactionAggregationKey

Updated transaction metrics

type TransactionAggregationKey struct {
	TraceRoot bool

	EventOutcome string

	ServiceVersion  string

	ServiceRuntimeName     string
	ServiceRuntimeVersion  string
	ServiceLanguageVersion string

	FAASColdstart   nullable.Bool
	FAASID          string
	FAASName        string
	FAASVersion     string
	FAASTriggerType string

	TransactionName   string
	TransactionType   string
	TransactionResult string
}

type TransactionMetrics struct {
	Histogram *hdrhistogram.HistogramRepresentation
}

Service instance transaction metrics

type ServiceInstanceTransactionAggregationKey struct {
	ContainerID       string
	KubernetesPodName string

	ServiceVersion  string
	ServiceNodeName string

	ServiceRuntimeName     string
	ServiceRuntimeVersion  string
	ServiceLanguageVersion string

	HostHostname   string
	HostName       string
	HostOSPlatform string

	CloudProvider         string
	CloudRegion           string
	CloudAvailabilityZone string
	CloudServiceName      string
	CloudAccountID        string
	CloudAccountName      string
	CloudMachineType      string
	CloudProjectID        string
	CloudProjectName      string

        TransactionType   string
}

type ServiceInstanceTransactionMetrics struct {
	Histogram *hdrhistogram.HistogramRepresentation
	failureCount        float64
	successCount        float64
}

Jul 26 '23 10:07 carsonip

ServiceNodeName is specific to a service instance (service node == service instance), so should only be in ServiceInstanceTransactionAggregationKey

Otherwise I think this looks good, thanks @carsonip

Jul 27 '23 01:07 axw

I would like to note that the changes to aggregation will be made in apm-aggregation, not existing aggregation code in apm-server. apm-server will use apm-aggregation after #11117 which is also scheduled to go into 8.10. Therefore, this task will be effectively blocked by it, although the development will happen in parallel.

Jul 27 '23 14:07 carsonip

What's the rationale behind not having the cloud metadata in transaction metrics? Are we expecting the combination of transaction name and regions/availability zones to create too high cardinality?

If users are filtering by AZ, for example, would the UI then use a combination of raw transaction docs for the transaction table and service instance transaction metrics for the instance table?

Jul 27 '23 15:07 felixbarny

@dgieselaar @gbamparop could you please review @carsonip 's proposal and address @felixbarny 's question so we can start with the implementation.

Jul 27 '23 15:07 simitt

What's the rationale behind not having the cloud metadata in transaction metrics? Are we expecting the combination of transaction name and regions/availability zones to create too high cardinality?

I didn't really think through that and assumed that cloud metadata are "instance-specific". Are you proposing that Cloud* dimensions should exist in both metrics?

Jul 27 '23 16:07 carsonip

The cloud metadata seems to be some gray area between instance specific and not instance specific.

I'd expect these pieces of metadata to be fairly static for the same application:

CloudProvider         string
CloudServiceName      string
CloudAccountID        string
CloudAccountName      string
CloudProjectID        string
CloudProjectName      string

Furthermore, I'd expect these to be low cardinality:

CloudRegion           string
CloudAvailabilityZone string

I don't know what to expect of this:

CloudMachineType      string

I think we should add the pieces of cloud metadata that turns out to be static. We should consider adding cloud metadata that is low cardinality, even if a service has lots of instances. If any piece of metadata is potentially medium to high cardinality, we shouldn't add it to the aggregation key for transaction metrics.

Jul 28 '23 05:07 felixbarny

LGTM. Personally I feel more comfortable leaving off cloud metadata. Agree that it is likely low cardinality but conceptually it feels like it belongs on service instance transaction metrics. For simplicity's sake I would not want some of the cloud metadata fields on regular tx metrics, but not all.

Jul 30 '23 05:07 dgieselaar

I'd also be more comfortable if we leave cloud metadata out for now, unless there's a clear use case for it. @felixbarny did you have any use cases in mind?

Jul 31 '23 02:07 axw

I'm good with that. It's always easier to add something to the aggregation key than to remove it.

A potential use case is if you want to compare the performance of a particular transaction group across different regions. Probably not critical. We'll still be able to do that by falling back to transaction events (although with less retention and unless tail-based sampling is uses). But I don't think that's critical. We can also compare across regions with metrics across all transaction groups.

Jul 31 '23 07:07 felixbarny

Thanks for confirming. I'll push forward with the revised proposal in my previous comment.

Jul 31 '23 07:07 carsonip

Would like a confirmation about an implementation detail about service instance transaction metrics overflow bucket. Are we also using transaction.type: _other as overflow key for service instance transaction metrics overflow? @dgieselaar WDYT?

Aug 01 '23 13:08 carsonip

@achyutjhunjhunwala do you know the answer to ^ by any chance?

Aug 01 '23 14:08 dgieselaar

Would like a confirmation about an implementation detail about service instance transaction metrics overflow bucket. Are we also using transaction.type: _other as overflow key for service instance transaction metrics overflow?

I am not sure if i understand the question correctly. We were using transaction.name: _other and service.name: _other for transaction and service metrics. @carsonip Are you proposing to use transaction.type: _other as the overflow key for this new metric given we won't have name?

Aug 01 '23 15:08 achyutjhunjhunwala

I am not sure if i understand the question correctly. We were using transaction.name: _other and service.name: _other for transaction and service metrics. @carsonip Are you proposing to use transaction.type: _other as the overflow key for this new metric given we won't have name?

Correct, that's just a proposal. Wanted to see if you have any problem with it.

Aug 01 '23 15:08 carsonip

@carsonip I am not 100% confident if that's the right key to use for overflow, because transaction.type is a low cardinality field which means chances of overflow is very narrow.

Aug 02 '23 08:08 achyutjhunjhunwala

@gbamparop mentioned in https://github.com/elastic/kibana/issues/162392#issuecomment-1662364252 that the APM UI counterpart may miss 8.10.

Aug 02 '23 16:08 carsonip

@carsonip is this the final version that has been implemented?

Aug 30 '23 13:08 gbamparop

@carsonip is this the final version that has been implemented?

@gbamparop The changes never got into 8.10 / main. They are still in a now-outdated draft PR: https://github.com/elastic/apm-aggregation/pull/43

The PR will have to be updated (most likely re-implemented due to all the conflicting changes) when the work is scheduled again.

Aug 30 '23 14:08 carsonip

Update: opened https://github.com/elastic/apm-aggregation/pull/103 to reimplement service instance transaction metrics

Sep 06 '23 13:09 kruskall

Discovered an issue with overflow metrics. Should be fixed by https://github.com/elastic/apm-aggregation/pull/106

Sep 12 '23 00:09 kruskall

As seen in the change here by @kruskall , transaction.type: _other is used to denote the overflow bucket of service instance transaction. @achyutjhunjhunwala I know we haven't completed the discussion above, so just bringing it up again to make sure APM UI is fine with this decision.

Sep 12 '23 10:09 carsonip

Overflow issue fixed and the broken test is solved. The PR is working and ready for review: https://github.com/elastic/apm-server/pull/11614

Sep 12 '23 18:09 kruskall

@carsonip I am not 100% sure if transaction.type: _other is the way to go. As i mentioned previously, transaction.type is not a high cardinality field.

We will be querying this metrics to display the Instance table on the Service Overview page like this

This makes me think, should overflow not happen at HostName ?

cc: @dgieselaar

Sep 13 '23 08:09 achyutjhunjhunwala