Index span.db.statement as text
The db statement is really useful, but it would be even more amazingly useful if we could search on it.
So by adding this mapping:

We should be able to search for spans which have a certain db query structure.

One gotcha I seem to be hitting in this test is that I can't search for jvm or heap . Seems like maybe I'm catching an "ignore_above" but I didn't think they were supposed to kick in on text fields.
Either way, I imagine it'd be quite useful to do things like highlight traces that involved specific query structures (indices, tables, etc).
@matschaffer thanks for opening.
Originally this field was intentionally not indexed, to avoid incurring high storage cost. I would like to revisit this decision, ideally with some data to show the cost is worthwhile.
Would match_only_text be reasonable for your needs?
For the purpose I had above, definitely.
Though I could imagine cases where ordering of word in the statement is important. For example if it were a sql query, you might be interested in if the table appeared before or after the join.
Just conjecture on that though.
As for the cost, have we gotten hard numbers on with/without? Seems like maybe we could add the mapping to a large index and check before/after. I'm sure ESS internal clusters have lots of large indices we could borrow for such a purpose.
As for the cost, have we gotten hard numbers on with/without?
No, that's really the only thing holding us back at the moment.
Cool. I was able to simply add the mapping to get the functionality. I didn't check on-disk size before/after, but I could. Is there a particular disk number you think would be most useful? Is the store size on https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html sufficient you think?
I think indexing and store are most interesting. It would also be helpful to know the proportion of docs that have a span.db.statement field.
I did a little poking at this today. I think we can probably get the ratio with:
GET apm-7.15.1-span-000001/_search
{
"size": 0,
"aggs": {
"wrapper": {
"terms": {
"field": "_index",
"size": 10
},
"aggs": {
"with_db_statement": {
"filter": {
"exists": {
"field": "span.db.statement"
}
}
},
"total": {
"filter": {
"match": {
"_index": "*"
}
}
},
"ratio": {
"bucket_script": {
"buckets_path": {
"numerator": "with_db_statement>_count",
"denominator": "total>_count"
},
"script": "params.numerator / params.denominator"
}
}
}
}
}
}
And the store sizing info with:
GET apm-7.15.1-span-000001/_stats/indexing,store
Here's one from us-east-1 staging metrics cluster
Before:
GET apm-7.15.0-span-000048/_stats/indexing,store
{
"_shards" : {
"total" : 6,
"successful" : 6,
"failed" : 0
},
"_all" : {
"primaries" : {
"store" : {
"size_in_bytes" : 3665198005,
"total_data_set_size_in_bytes" : 3665198005,
"reserved_in_bytes" : 0
},
"indexing" : {
"index_total" : 0,
"index_time_in_millis" : 0,
"index_current" : 0,
"index_failed" : 0,
"delete_total" : 0,
"delete_time_in_millis" : 0,
"delete_current" : 0,
"noop_update_total" : 0,
"is_throttled" : false,
"throttle_time_in_millis" : 0
}
},
"total" : {
"store" : {
"size_in_bytes" : 7330877145,
"total_data_set_size_in_bytes" : 7330877145,
"reserved_in_bytes" : 0
},
"indexing" : {
"index_total" : 0,
"index_time_in_millis" : 0,
"index_current" : 0,
"index_failed" : 0,
"delete_total" : 0,
"delete_time_in_millis" : 0,
"delete_current" : 0,
"noop_update_total" : 0,
"is_throttled" : false,
"throttle_time_in_millis" : 0
}
}
},
"indices" : {
"apm-7.15.0-span-000048" : {
"uuid" : "KMHvD2_mQ9OYg2Eli4cq9w",
"primaries" : {
"store" : {
"size_in_bytes" : 3665198005,
"total_data_set_size_in_bytes" : 3665198005,
"reserved_in_bytes" : 0
},
"indexing" : {
"index_total" : 0,
"index_time_in_millis" : 0,
"index_current" : 0,
"index_failed" : 0,
"delete_total" : 0,
"delete_time_in_millis" : 0,
"delete_current" : 0,
"noop_update_total" : 0,
"is_throttled" : false,
"throttle_time_in_millis" : 0
}
},
"total" : {
"store" : {
"size_in_bytes" : 7330877145,
"total_data_set_size_in_bytes" : 7330877145,
"reserved_in_bytes" : 0
},
"indexing" : {
"index_total" : 0,
"index_time_in_millis" : 0,
"index_current" : 0,
"index_failed" : 0,
"delete_total" : 0,
"delete_time_in_millis" : 0,
"delete_current" : 0,
"noop_update_total" : 0,
"is_throttled" : false,
"throttle_time_in_millis" : 0
}
}
}
}
}
Mapping:
PUT apm-7.15.0-span-000048/_mapping
{
"properties": {
"span": {
"properties": {
"db": {
"properties": {
"statement": {
"type": "text"
}
}
}
}
}
}
}
After:
GET apm-7.15.0-span-000048/_stats/indexing,store
{
"_shards" : {
"total" : 6,
"successful" : 6,
"failed" : 0
},
"_all" : {
"primaries" : {
"store" : {
"size_in_bytes" : 3665198005,
"total_data_set_size_in_bytes" : 3665198005,
"reserved_in_bytes" : 0
},
"indexing" : {
"index_total" : 0,
"index_time_in_millis" : 0,
"index_current" : 0,
"index_failed" : 0,
"delete_total" : 0,
"delete_time_in_millis" : 0,
"delete_current" : 0,
"noop_update_total" : 0,
"is_throttled" : false,
"throttle_time_in_millis" : 0
}
},
"total" : {
"store" : {
"size_in_bytes" : 7330877145,
"total_data_set_size_in_bytes" : 7330877145,
"reserved_in_bytes" : 0
},
"indexing" : {
"index_total" : 0,
"index_time_in_millis" : 0,
"index_current" : 0,
"index_failed" : 0,
"delete_total" : 0,
"delete_time_in_millis" : 0,
"delete_current" : 0,
"noop_update_total" : 0,
"is_throttled" : false,
"throttle_time_in_millis" : 0
}
}
},
"indices" : {
"apm-7.15.0-span-000048" : {
"uuid" : "KMHvD2_mQ9OYg2Eli4cq9w",
"primaries" : {
"store" : {
"size_in_bytes" : 3665198005,
"total_data_set_size_in_bytes" : 3665198005,
"reserved_in_bytes" : 0
},
"indexing" : {
"index_total" : 0,
"index_time_in_millis" : 0,
"index_current" : 0,
"index_failed" : 0,
"delete_total" : 0,
"delete_time_in_millis" : 0,
"delete_current" : 0,
"noop_update_total" : 0,
"is_throttled" : false,
"throttle_time_in_millis" : 0
}
},
"total" : {
"store" : {
"size_in_bytes" : 7330877145,
"total_data_set_size_in_bytes" : 7330877145,
"reserved_in_bytes" : 0
},
"indexing" : {
"index_total" : 0,
"index_time_in_millis" : 0,
"index_current" : 0,
"index_failed" : 0,
"delete_total" : 0,
"delete_time_in_millis" : 0,
"delete_current" : 0,
"noop_update_total" : 0,
"is_throttled" : false,
"throttle_time_in_millis" : 0
}
}
}
}
}
Ratio:
GET apm-7.15.0-span-000048/_search
Request:
{
"size": 0,
"aggs": {
"wrapper": {
"terms": {
"field": "_index",
"size": 10
},
"aggs": {
"with_db_statement": {
"filter": {
"exists": {
"field": "span.db.statement"
}
}
},
"total": {
"filter": {
"match": {
"_index": "*"
}
}
},
"ratio": {
"bucket_script": {
"buckets_path": {
"numerator": "with_db_statement>_count",
"denominator": "total>_count"
},
"script": "params.numerator / params.denominator"
}
}
}
}
}
}
Response:
{
"size": 0,
"aggs": {
"wrapper": {
"terms": {
"field": "_index",
"size": 10
},
"aggs": {
"with_db_statement": {
"filter": {
"exists": {
"field": "span.db.statement"
}
}
},
"total": {
"filter": {
"match": {
"_index": "*"
}
}
},
"ratio": {
"bucket_script": {
"buckets_path": {
"numerator": "with_db_statement>_count",
"denominator": "total>_count"
},
"script": "params.numerator / params.denominator"
}
}
}
}
}
}
Sadly it looks like this index had no db statement spans 😆
I'll see if I can dig around for a sample apm span index with more db statements around.
Wonder if it might make sense to use @dgieselaar 's apm synthtrace to just generate a bunch of random db statements as part of the span data. Not as nice as "real" but looks like our biggest "real" data source doesn't record db statements at all today.
Sadly it looks like this index had no db statement spans laughing
@matschaffer oh, bummer. Thanks for digging into it anyway.
Wonder if it might make sense to use @dgieselaar 's apm synthtrace to just generate a bunch of random db statements as part of the span data. Not as nice as "real" but looks like our biggest "real" data source doesn't record db statements at all today.
++
I'd like to use synthtrace (some kind of generator) to generate APM events, pass them through apm-server, and then use the resulting docs in Rally (https://github.com/elastic/apm-server/issues/6115). Then we can measure the impact of mapping changes like, while keeping the results in sync with the changes in how apm-server structures docs.
Btw. from 7.15 you can get the actual disk usage per field, which is really useful to test mapping changes. https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-disk-usage.html
Also wanted to note, match_only_text does support phrase queries, it'll just be slower since it needs to look at _source to verify ordering.
I wanted to note that span.db.statement may contain sensitive information.
@tobiasstadler we already store span.db.statement in the Elasticsearch document, it's just that we don't index it for fast searching. It's already possible to search the field using runtime fields by defining a runtime field in the search request, so I think the only difference would be the speed of the search. Is there something else about indexing it that increases security risk?
No, I don't thin so. Sorry I didn't know/forgot that the value is stored already.
No worries! Thanks for raising your concern anyway.
I did the following in our test cluster:
PUT /apm-7.15.0-span-000001/_mapping
{
"properties": {
"span": {
"properties": {
"db": {
"properties": {
"statement": {
"type": "text"
}
}
}
}
}
}
}
and
POST /apm-7.15.0-span-000001/_update_by_query
Now
POST /apm-7.15.0-span-000001/_disk_usage?run_expensive_tasks=true
results in
...
"span.db.statement" : {
"total" : "49.9mb",
"total_in_bytes" : 52365481,
"inverted_index" : {
"total" : "35mb",
"total_in_bytes" : 36720077
},
"stored_fields" : "0b",
"stored_fields_in_bytes" : 0,
"doc_values" : "0b",
"doc_values_in_bytes" : 0,
"points" : "0b",
"points_in_bytes" : 0,
"norms" : "14.9mb",
"norms_in_bytes" : 15645404,
"term_vectors" : "0b",
"term_vectors_in_bytes" : 0
},
...
apm-7.15.0-span-000001 has 13610386 docs with span.db.statement.
I hope this helps.
Thanks @tobiasstadler! Those numbers suggest we have nothing to be concerned about, storage wise. It would be helpful to know what the indexing overhead is (I expect it's also fine).
Is there a way to measure to overhead?
@tobiasstadler using the Elasticsearch _stats API like in https://github.com/elastic/apm-server/issues/6586#issuecomment-971247963 (comparing before & after). However, I just saw that Mat's results in https://github.com/elastic/apm-server/issues/6586#issuecomment-971285980 and the indexing stats are all zero, so I'm not exactly sure what steps are necessary here.
Sorry, but I do not have any "before" data.
No worries, we'll run some experiments. Thanks all the same :)
This will kinda be handled as a side-effect of https://github.com/elastic/apm-server/issues/11528, and the general move to dynamic mapping. I say "kinda" because the dynamic mapping rules will map this field as a keyword field, and not text. Still, it will be searchable (albeit more slowly), and folks will still be able to override this with a custom component template.
See also https://github.com/elastic/apm-server/pull/12098