apm Support array of strings as a value for tags/labels

Description of the issue

Currently tags/labels are stored as key/value pairs in ES, however, we have seen users expecting tags to be stored as an array of strings, this is consistent with ECS-style tags.

I'm proposing to support array of strings for adding tags/labels both on the APM server as well as the agent APIs.

The alternative to this approach is to have both labels for key/values and tags for arrays but since supporting just labels (with support for arrays) is more generic, IMO, we should implement the proposed solution.

An example for JavaScript:

apm.addTags({'teams': ['team1','team2']})

What we are voting on

Is this an acceptable solution for agents/apm server? cc @elastic/apm-agent-devs , @elastic/apm-server

Vote

Team	Yes	No	Indifferent	N/A
.NET	[ ]	[ ]	[ ]	[ ]
Go	[x]	[ ]	[ ]	[ ]
Java	[ ]	[ ]	[ ]	[ ]
Node.js	[ ]	[ ]	[ ]	[ ]
Python	[ ]	[ ]	[ ]	[ ]
Ruby	[ ]	[ ]	[ ]	[ ]
RUM	[ ]	[ ]	[ ]	[ ]
APM Server	[ ]	[ ]	[ ]	[ ]

Apr 30 '19 10:04 hmdhk

I don't think that this is compatible with ECS:

{
  "tags": {
    "teams": ["team1","team2"]
  }
}

Not sure about that version:

{
  "labels": {
    "teams": ["team1","team2"]
  }
}

I would argue it's not because it's not a key/value pair. But it's probably debatable if non-scalar values are allowed.

However, this version would be compatible with ECS:

{
  "tags": ["team1","team2"]
}

This version would also solve the specific use case this has been requested for, which is to consistently tag documents (not only in APM) with team names to make access rules based on that.

Apr 30 '19 11:04 felixbarny

@felixbarny , I think this is the same to what I called "alternative approach" in the issue description. So to be clear the second approach is to support both labels for key/values and tags for string arrays, is that what you meant?

It seems redundant to me to support both, but we definitely can do that as well!

Apr 30 '19 11:04 hmdhk

@ruflin Do you know about prior art regarding labels with multiple values? Is that something you'd consider legal from an ECS perspective?

Apr 30 '19 13:04 felixbarny

@felixbarny I would consider the following "legal" from an ECS perspective:

{
  "labels": {
    "teams": ["team1","team2"]
  }
}

The reason is that any field in Elasticsearch (Lucene) can be/is an array. Even user.id: [12, 17] would be valid. It does not change the mapping type.

What I would consider "illegal" (😆 ) from an ECS perspective is:

{
  "tags": {
    "teams": ["team1","team2"]
  }
}

This changes the mapping type of ECS.

Apr 30 '19 14:04 ruflin

I agree with @felixbarny and @ruflin that the suggested way of storing tags is not ECS compatible. When the argument is consistency with ECS, I don't see why having both, labels and tags in the Elasticsearch docs should be an issue. Afaik the problem is that currently the agents offer add_tags (potentially deprecated) for what we treat as labels in the ES documents (from 7.0 on).

May 03 '19 11:05 simitt

I'm assuming that we will at some point rename anything with "tags" to "labels" on the agent side, i.e. apm.addTags({'teams': ['team1','team2']}) -> apm.addLabels({'teams': ['team1','team2']}) as discussed here. This would generate the following payload which is valid from ECS perspective:

{
  "labels": {
    "teams": ["team1","team2"]
  }
}

I'm ok with supporting both Tags and Labels as well, however, supporting ECS style tags is a breaking (if we keep the same name e.g. addTags) since the agent API should accept a different data type.

We can take the following path: We add labels to the APM server intake spec with array type support (this is not a breaking change). Agents can use labels in their payloads after that's released, this is a breaking change for the agent since the new agent won't work with older APM servers. APM server would change the tags definition for the intake to array of strings since this is a breaking change for the APM server I assume we can only do this in version 8.0, after this point agents can start accepting/sending ECS style tags to APM server.

May 07 '19 09:05 hmdhk

If I recall correctly, the reason why we renamed tags to labels was that tags in ECS were just an array of values with no keys, whereas our definition of tags was "key: value". For that format, the labels in ECS was a better fit - hence the rename.

If labels support that one label key can have multiple values stored as an array, we could technically add support for it, but I'm not sure what the use-case for our users would be? I find having multiple label values for the same label key confusing 🤔

May 07 '19 09:05 watson

No suggestion yet, but let's separate terminology and review how agents, server and ES behave at the moment.

Agents and Server currently only process and index key:value pairs where values can be of type string, boolean, integer. Those key:value pairs are currently referred to as tags, in some cases also as labels:

Agents nodejs:

apm.setTag, apm.addTags (deprecated)
apm.setLabel, apm.addLabels

python:

elasticapm.tag, elasticapm.capture_span(tags={})

ruby:

ElasticAPM.set_tag

rum:

apm.addTags

go

transaction.Context.SetTag

java

transaction.addTag, span.addTag (deprecated)
transaction.addLabel, span.addLabel

.Net

transaction.Tags, span.Tags

APM Server Intake API

*.context.tags
metadata.labels

ES

labels, according to top level labels in ECS

This means that the nodejs and java agent currently offer adding labels while all other agents have an API to add tags. The server only uses the term tags within context, but labels for metadata, describing the same data structure. In ES >=7.0 we store the data as labels, in 6.x as tags.

May 22 '19 07:05 simitt

@simitt According to #42, it was my understanding that all agents were supposed to rename tags to labels. This is why we did it in the Node.js agent. The old tags API is just an alias for the new labels API and only kept around for backward compatible reasons (will be removed in the next major)

May 22 '19 07:05 watson

What @watson said (replace Node.js with Java)

May 22 '19 07:05 eyalkoren

I meant to give a neutral summary of where we are right now, as I found the arguments on what is used where and what would be a breaking change hard to follow. (No need for explanation or defense for already changing to labels or sticking to tags).

May 22 '19 10:05 simitt

If labels support that one label key can have multiple values stored as an array, we could technically add support for it, but I'm not sure what the use-case for our users would be? I find having multiple label values for the same label key confusing

Maybe this helps: We have a Java web application which is based on the quartz job engine and those jobs read data from different sources and based on that data generates Jobs to transfer them to different target systems. For performance reasons some usecases do not process single jobs - instead we process batches of jobs. To be able to trace performance problems we are storing the information about the jobs in labels: The label "job_id" contains a single numeric Job ID for jobs which process one job at a time. The label "job_list" contains the list of Job IDs for jobs with batch features. As it is currently not possible to store arrays of values we are currently using a String value with the comma separated list of job ids.

Having an array here would be a cleaner solution.

Jul 26 '19 12:07 wolframhaussig

@webmat We discussed this issue in our agents meeting yesterday. We were wondering if you could comment on the compatibility of the proposed format for labels with ECS?

@simitt Would supporting the format mentioned in this comment entail a lot of work on the server side?

Jan 29 '20 13:01 estolfo

I concur with @ruflin, nesting keys under tags doesn't work with ECS' mapping, which identifies the tags field as keyword. So "tags": ["foo", "bar"] should be the format.

In recent months there's been an increasing demand to clearly identify which ECS fields are expected to be arrays. Even if it's not necessary for Elasticsearch itself (there's no way to identify that in a mapping), it's helpful to clarify that expectation for consumers of the data, as well as for libraries that are mapping the schema to various programming languages (see https://github.com/elastic/ecs-logging). A PR for this is in progress (elastic/ecs#727), and labels hadn't come up yet. I'm on the fence on marking labels as one of the array fields, as that would make it "mandatory" that all subfields should be normalized and expected to be arrays. This would be surprising to users. On the other hand, not marking it as an array field still lets users cram arrays in there, and Elasticsearch will deal well with this. Since they're by definition custom fields, it's on users to know which of their custom labels fields are arrays and which aren't.

Sorry for the unexpected wall of text here :-) But long story short, supporting arrays in labels fields sounds fine to me, but I'm not convinced we should make that official for all subfields via the above ECS PR. Does that make sense?

Another point: the mapping for labels is object with object_type: keyword (the type of the subfields). This forces ES to interpret whatever we throw at it as keyword/strings. Was APM leaving this mapping dynamic?

Jan 30 '20 16:01 webmat

Makes sense Mat.

In APM, label values can be of type string, Boolean, or number. This allows, for example, to create graphs based on label values.

Any plans to support that with ECS?

Jan 30 '20 17:01 felixbarny

The current approach of having labels be object_type: keyword predates my involvement in the project. I always assumed it forced the mapping type of subkeys. Looking at the ES object datatype, it doesn't have an object_type property.

The reality is that both the sample ECS Elasticsearch template and the Beats templates I've looked at have this field as

"labels" : {
    "type" : "object"
}

So I confirmed the behaviour of subkeys of labels, when passing in numerics or even an object. And it's indeed a dynamic mapping. You can test this out via one of the beats mappings this way:

POST _bulk
{ "index" : { "_index" : "filebeat-7.5.1-labels-test" } }
{ "labels": { "label1": 42 } }
{ "index" : { "_index" : "filebeat-7.5.1-labels-test" } }
{ "labels": { "label2": { "foo": "bar" } } }

The mapping generated is

        "labels" : {
          "properties" : {
            "label1" : {
              "type" : "long"
            },
            "label2" : {
              "properties" : {
                "foo" : {
                  "type" : "keyword"
                }
              }
            }
          }
        },

So in effect, the label values can already be numerics or booleans (or even nested objects), provided the values the right format, in the document's labels section. 🤦‍♂ (or 🎉 ?)

Jan 31 '20 14:01 webmat

FWIW, this is the labels definition form APM Server:

https://github.com/elastic/apm-server/blob/2359a46408b558ae88d07dff3f4eb73618c2e0d2/_meta/fields.common.yml#L121-L131

Which gets converted to this dynamic template:

"dynamic_templates" : [
  {
    "labels" : {
      "path_match" : "labels.*",
      "mapping" : {
        "type" : "keyword"
      },
      "match_mapping_type" : "string"
    }
  },
  {
    "labels" : {
      "path_match" : "labels.*",
      "mapping" : {
        "type" : "boolean"
      },
      "match_mapping_type" : "boolean"
    }
  },
  {
    "labels" : {
      "path_match" : "labels.*",
      "mapping" : {
        "scaling_factor" : 1000000,
        "type" : "scaled_float"
      },
      "match_mapping_type" : "*"
    }
  },
  ...
]

The APM Server also makes sure that the labels values can only be of type string, boolean or number, by validating against a JSON schema. See https://github.com/elastic/apm-server/blob/38a5aa3fe20b8eb828d751e81bcf0b07d4a06d7a/docs/spec/metadata.json#L32-L34 and https://github.com/elastic/apm-server/blob/38a5aa3fe20b8eb828d751e81bcf0b07d4a06d7a/docs/spec/tags.json#L6-L11

Jan 31 '20 15:01 felixbarny

@simitt Would supporting the format mentioned in this comment entail a lot of work on the server side?

Sorry for the late reply. Adding support for arrays of string, boolean and numbers on the APM Server side would be relatively straight forward.

Are agent devs considering to release this support as major version change? If an agent sending an array of values the request would be rejected by an older APM Server. So this is not a breaking change for the APM Server, as it still works with older agents, but I think it would need to be for the agents. The server cannot simply ignore payloads with arrays, as the field and the validation against it already exists in older versions.

Feb 03 '20 08:02 simitt

When agents check the version of the APM Server there is no Breaking change. But for the RUM agent, that might not be possible.

Feb 03 '20 09:02 felixbarny

@felixbarny what would you do if the user sets an array for a label, but the server version doesn't support it? Only sending the first value seems dangerous e.g. for the security use case lined out above, as is completely ignoring that label.

Feb 03 '20 09:02 beniwohli

There's currently an API for Span#addLabel(String key, String value). I wouldn't even change the API to accommodate for multiple values. Users can just call addLabel with the same key multiple times. Depending on the APM Server version, only the last one or all of the values would be sent.

If it's critical for users that all values get sent, they just have to ensure that they are running on a certain version of APM Server. We can also log a warning when discarding values due to a version mismatch.

Feb 03 '20 13:02 felixbarny

Hi @felixbarny,

I am not sure, if the following should work in the python client,

from elasticapm import label as apmLabel
apmLabel(teams='team1')
apmLabel(teams='team2')
apmLabel(teams='team3')

I am using python apm client version ==6.0.0 and the apm server version 7.10.2 But after executing this, I get

...
"labels" : {
            "teams" : "team3"
          },
...

An I doing something wrong, or it is not working as intended .... Thanks for your time and support.

Mar 03 '21 08:03 gk-patel

I found a work-around, by adding an extra APM pipeline, which splits required fields. So from the python code, I send "team1,team2,team3" and then the pipeline splits it into an array before storing it.

Mar 05 '21 07:03 gk-patel