apm
apm copied to clipboard
Support array of strings as a value for tags/labels
Description of the issue
Currently tags/labels are stored as key/value pairs in ES, however, we have seen users expecting tags to be stored as an array of strings, this is consistent with ECS-style tags.
I'm proposing to support array of strings for adding tags/labels both on the APM server as well as the agent APIs.
The alternative to this approach is to have both labels for key/values and tags for arrays but since supporting just labels (with support for arrays) is more generic, IMO, we should implement the proposed solution.
An example for JavaScript:
apm.addTags({'teams': ['team1','team2']})
What we are voting on
Is this an acceptable solution for agents/apm server? cc @elastic/apm-agent-devs , @elastic/apm-server
Vote
Team | Yes | No | Indifferent | N/A | Link to issue |
---|---|---|---|---|---|
.NET |
|
|
|
|
|
Go |
|
|
|
|
|
Java |
|
|
|
|
|
Node.js |
|
|
|
|
|
Python |
|
|
|
|
|
Ruby |
|
|
|
|
|
RUM |
|
|
|
|
|
APM Server |
|
|
|
|
I don't think that this is compatible with ECS:
{
"tags": {
"teams": ["team1","team2"]
}
}
Not sure about that version:
{
"labels": {
"teams": ["team1","team2"]
}
}
I would argue it's not because it's not a key/value pair. But it's probably debatable if non-scalar values are allowed.
However, this version would be compatible with ECS:
{
"tags": ["team1","team2"]
}
This version would also solve the specific use case this has been requested for, which is to consistently tag
documents (not only in APM) with team names to make access rules based on that.
@felixbarny , I think this is the same to what I called "alternative approach" in the issue description. So to be clear the second approach is to support both labels for key/values and tags for string arrays, is that what you meant?
It seems redundant to me to support both, but we definitely can do that as well!
@ruflin Do you know about prior art regarding labels with multiple values? Is that something you'd consider legal from an ECS perspective?
@felixbarny I would consider the following "legal" from an ECS perspective:
{
"labels": {
"teams": ["team1","team2"]
}
}
The reason is that any field in Elasticsearch (Lucene) can be/is an array. Even user.id: [12, 17]
would be valid. It does not change the mapping type.
What I would consider "illegal" (😆 ) from an ECS perspective is:
{
"tags": {
"teams": ["team1","team2"]
}
}
This changes the mapping type of ECS.
I agree with @felixbarny and @ruflin that the suggested way of storing tags is not ECS compatible. When the argument is consistency with ECS, I don't see why having both, labels
and tags
in the Elasticsearch docs should be an issue.
Afaik the problem is that currently the agents offer add_tags
(potentially deprecated) for what we treat as labels
in the ES documents (from 7.0 on).
I'm assuming that we will at some point rename anything with "tags" to "labels" on the agent side, i.e. apm.addTags({'teams': ['team1','team2']})
-> apm.addLabels({'teams': ['team1','team2']})
as discussed here. This would generate the following payload which is valid from ECS perspective:
{
"labels": {
"teams": ["team1","team2"]
}
}
I'm ok with supporting both Tags and Labels as well, however, supporting ECS style tags is a breaking (if we keep the same name e.g. addTags
) since the agent API should accept a different data type.
We can take the following path: We add labels
to the APM server intake spec with array type support (this is not a breaking change). Agents can use labels
in their payloads after that's released, this is a breaking change for the agent since the new agent won't work with older APM servers. APM server would change the tags
definition for the intake to array of strings since this is a breaking change for the APM server I assume we can only do this in version 8.0, after this point agents can start accepting/sending ECS style tags to APM server.
If I recall correctly, the reason why we renamed tags to labels was that tags in ECS were just an array of values with no keys, whereas our definition of tags was "key: value". For that format, the labels in ECS was a better fit - hence the rename.
If labels support that one label key can have multiple values stored as an array, we could technically add support for it, but I'm not sure what the use-case for our users would be? I find having multiple label values for the same label key confusing 🤔
No suggestion yet, but let's separate terminology and review how agents, server and ES behave at the moment.
Agents and Server currently only process and index key:value
pairs where values can be of type string, boolean, integer
. Those key:value pairs are currently referred to as tags
, in some cases also as labels
:
Agents nodejs:
-
apm.setTag
,apm.addTags
(deprecated) -
apm.setLabel
,apm.addLabels
python:
-
elasticapm.tag
,elasticapm.capture_span(tags={})
ruby:
-
ElasticAPM.set_tag
rum:
-
apm.addTags
go
-
transaction.Context.SetTag
java
-
transaction.addTag
,span.addTag
(deprecated) -
transaction.addLabel
,span.addLabel
.Net
-
transaction.Tags
,span.Tags
APM Server Intake API
-
*.context.tags
-
metadata.labels
ES
-
labels
, according to top level labels in ECS
This means that the nodejs
and java
agent currently offer adding labels
while all other agents have an API to add tags
. The server only uses the term tags
within context
, but labels
for metadata
, describing the same data structure. In ES >=7.0
we store the data as labels
, in 6.x
as tags
.
@simitt According to #42, it was my understanding that all agents were supposed to rename tags to labels. This is why we did it in the Node.js agent. The old tags API is just an alias for the new labels API and only kept around for backward compatible reasons (will be removed in the next major)
What @watson said (replace Node.js
with Java
)
I meant to give a neutral summary of where we are right now, as I found the arguments on what is used where and what would be a breaking change hard to follow. (No need for explanation or defense for already changing to labels
or sticking to tags
).
If labels support that one label key can have multiple values stored as an array, we could technically add support for it, but I'm not sure what the use-case for our users would be? I find having multiple label values for the same label key confusing
Maybe this helps: We have a Java web application which is based on the quartz job engine and those jobs read data from different sources and based on that data generates Jobs to transfer them to different target systems. For performance reasons some usecases do not process single jobs - instead we process batches of jobs. To be able to trace performance problems we are storing the information about the jobs in labels: The label "job_id" contains a single numeric Job ID for jobs which process one job at a time. The label "job_list" contains the list of Job IDs for jobs with batch features. As it is currently not possible to store arrays of values we are currently using a String value with the comma separated list of job ids.
Having an array here would be a cleaner solution.
@webmat We discussed this issue in our agents meeting yesterday. We were wondering if you could comment on the compatibility of the proposed format for labels
with ECS?
@simitt Would supporting the format mentioned in this comment entail a lot of work on the server side?
I concur with @ruflin, nesting keys under tags
doesn't work with ECS' mapping, which identifies the tags
field as keyword
. So "tags": ["foo", "bar"]
should be the format.
In recent months there's been an increasing demand to clearly identify which ECS fields are expected to be arrays. Even if it's not necessary for Elasticsearch itself (there's no way to identify that in a mapping), it's helpful to clarify that expectation for consumers of the data, as well as for libraries that are mapping the schema to various programming languages (see https://github.com/elastic/ecs-logging). A PR for this is in progress (elastic/ecs#727), and labels
hadn't come up yet. I'm on the fence on marking labels
as one of the array fields, as that would make it "mandatory" that all subfields should be normalized and expected to be arrays. This would be surprising to users. On the other hand, not marking it as an array field still lets users cram arrays in there, and Elasticsearch will deal well with this. Since they're by definition custom fields, it's on users to know which of their custom labels fields are arrays and which aren't.
Sorry for the unexpected wall of text here :-) But long story short, supporting arrays in labels
fields sounds fine to me, but I'm not convinced we should make that official for all subfields via the above ECS PR. Does that make sense?
Another point: the mapping for labels
is object
with object_type: keyword
(the type of the subfields). This forces ES to interpret whatever we throw at it as keyword/strings. Was APM leaving this mapping dynamic?
Makes sense Mat.
In APM, label values can be of type string, Boolean, or number. This allows, for example, to create graphs based on label values.
Any plans to support that with ECS?
The current approach of having labels be object_type: keyword
predates my involvement in the project. I always assumed it forced the mapping type of subkeys. Looking at the ES object
datatype, it doesn't have an object_type
property.
The reality is that both the sample ECS Elasticsearch template and the Beats templates I've looked at have this field as
"labels" : {
"type" : "object"
}
So I confirmed the behaviour of subkeys of labels
, when passing in numerics or even an object. And it's indeed a dynamic mapping. You can test this out via one of the beats mappings this way:
POST _bulk
{ "index" : { "_index" : "filebeat-7.5.1-labels-test" } }
{ "labels": { "label1": 42 } }
{ "index" : { "_index" : "filebeat-7.5.1-labels-test" } }
{ "labels": { "label2": { "foo": "bar" } } }
The mapping generated is
"labels" : {
"properties" : {
"label1" : {
"type" : "long"
},
"label2" : {
"properties" : {
"foo" : {
"type" : "keyword"
}
}
}
}
},
So in effect, the label values can already be numerics or booleans (or even nested objects), provided the values the right format, in the document's labels
section. 🤦♂ (or 🎉 ?)
FWIW, this is the labels
definition form APM Server:
https://github.com/elastic/apm-server/blob/2359a46408b558ae88d07dff3f4eb73618c2e0d2/_meta/fields.common.yml#L121-L131
Which gets converted to this dynamic template:
"dynamic_templates" : [
{
"labels" : {
"path_match" : "labels.*",
"mapping" : {
"type" : "keyword"
},
"match_mapping_type" : "string"
}
},
{
"labels" : {
"path_match" : "labels.*",
"mapping" : {
"type" : "boolean"
},
"match_mapping_type" : "boolean"
}
},
{
"labels" : {
"path_match" : "labels.*",
"mapping" : {
"scaling_factor" : 1000000,
"type" : "scaled_float"
},
"match_mapping_type" : "*"
}
},
...
]
The APM Server also makes sure that the labels
values can only be of type string, boolean or number, by validating against a JSON schema. See https://github.com/elastic/apm-server/blob/38a5aa3fe20b8eb828d751e81bcf0b07d4a06d7a/docs/spec/metadata.json#L32-L34 and https://github.com/elastic/apm-server/blob/38a5aa3fe20b8eb828d751e81bcf0b07d4a06d7a/docs/spec/tags.json#L6-L11
@simitt Would supporting the format mentioned in this comment entail a lot of work on the server side?
Sorry for the late reply. Adding support for arrays of string, boolean and numbers on the APM Server side would be relatively straight forward.
Are agent devs considering to release this support as major version change? If an agent sending an array of values the request would be rejected by an older APM Server. So this is not a breaking change for the APM Server, as it still works with older agents, but I think it would need to be for the agents. The server cannot simply ignore payloads with arrays, as the field and the validation against it already exists in older versions.
When agents check the version of the APM Server there is no Breaking change. But for the RUM agent, that might not be possible.
@felixbarny what would you do if the user sets an array for a label, but the server version doesn't support it? Only sending the first value seems dangerous e.g. for the security use case lined out above, as is completely ignoring that label.
There's currently an API for Span#addLabel(String key, String value)
. I wouldn't even change the API to accommodate for multiple values. Users can just call addLabel
with the same key multiple times. Depending on the APM Server version, only the last one or all of the values would be sent.
If it's critical for users that all values get sent, they just have to ensure that they are running on a certain version of APM Server. We can also log a warning when discarding values due to a version mismatch.
Hi @felixbarny,
I am not sure, if the following should work in the python client,
from elasticapm import label as apmLabel
apmLabel(teams='team1')
apmLabel(teams='team2')
apmLabel(teams='team3')
I am using python apm client version ==6.0.0 and the apm server version 7.10.2 But after executing this, I get
...
"labels" : {
"teams" : "team3"
},
...
An I doing something wrong, or it is not working as intended .... Thanks for your time and support.
I found a work-around, by adding an extra APM pipeline, which splits required fields. So from the python code, I send "team1,team2,team3" and then the pipeline splits it into an array before storing it.