apm-server icon indicating copy to clipboard operation
apm-server copied to clipboard

Index span.db.statement as text

Open matschaffer opened this issue 4 years ago • 24 comments

The db statement is really useful, but it would be even more amazingly useful if we could search on it.

So by adding this mapping:

Screen Shot 2021-11-11 at 13 59 51

We should be able to search for spans which have a certain db query structure.

Screen Shot 2021-11-11 at 14 05 04

One gotcha I seem to be hitting in this test is that I can't search for jvm or heap . Seems like maybe I'm catching an "ignore_above" but I didn't think they were supposed to kick in on text fields.

Either way, I imagine it'd be quite useful to do things like highlight traces that involved specific query structures (indices, tables, etc).

matschaffer avatar Nov 11 '21 05:11 matschaffer

@matschaffer thanks for opening.

Originally this field was intentionally not indexed, to avoid incurring high storage cost. I would like to revisit this decision, ideally with some data to show the cost is worthwhile.

Would match_only_text be reasonable for your needs?

axw avatar Nov 11 '21 07:11 axw

For the purpose I had above, definitely.

Though I could imagine cases where ordering of word in the statement is important. For example if it were a sql query, you might be interested in if the table appeared before or after the join.

Just conjecture on that though.

matschaffer avatar Nov 15 '21 02:11 matschaffer

As for the cost, have we gotten hard numbers on with/without? Seems like maybe we could add the mapping to a large index and check before/after. I'm sure ESS internal clusters have lots of large indices we could borrow for such a purpose.

matschaffer avatar Nov 15 '21 02:11 matschaffer

As for the cost, have we gotten hard numbers on with/without?

No, that's really the only thing holding us back at the moment.

axw avatar Nov 15 '21 02:11 axw

Cool. I was able to simply add the mapping to get the functionality. I didn't check on-disk size before/after, but I could. Is there a particular disk number you think would be most useful? Is the store size on https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html sufficient you think?

matschaffer avatar Nov 15 '21 02:11 matschaffer

I think indexing and store are most interesting. It would also be helpful to know the proportion of docs that have a span.db.statement field.

axw avatar Nov 15 '21 02:11 axw

I did a little poking at this today. I think we can probably get the ratio with:

GET apm-7.15.1-span-000001/_search
{
  "size": 0,
  "aggs": {
    "wrapper": {
      "terms": {
        "field": "_index",
        "size": 10
      },
      "aggs": {
        "with_db_statement": {
          "filter": {
            "exists": {
              "field": "span.db.statement"
            }
          }
        },
        "total": {
          "filter": {
            "match": {
              "_index": "*"
            }
          }
        },
        "ratio": {
          "bucket_script": {
            "buckets_path": {
              "numerator": "with_db_statement>_count",
              "denominator": "total>_count"
            },
            "script": "params.numerator / params.denominator"
          }
        }
      }
    }
  }
}

And the store sizing info with:

GET apm-7.15.1-span-000001/_stats/indexing,store

matschaffer avatar Nov 17 '21 06:11 matschaffer

Here's one from us-east-1 staging metrics cluster

Before:

GET apm-7.15.0-span-000048/_stats/indexing,store
{
  "_shards" : {
    "total" : 6,
    "successful" : 6,
    "failed" : 0
  },
  "_all" : {
    "primaries" : {
      "store" : {
        "size_in_bytes" : 3665198005,
        "total_data_set_size_in_bytes" : 3665198005,
        "reserved_in_bytes" : 0
      },
      "indexing" : {
        "index_total" : 0,
        "index_time_in_millis" : 0,
        "index_current" : 0,
        "index_failed" : 0,
        "delete_total" : 0,
        "delete_time_in_millis" : 0,
        "delete_current" : 0,
        "noop_update_total" : 0,
        "is_throttled" : false,
        "throttle_time_in_millis" : 0
      }
    },
    "total" : {
      "store" : {
        "size_in_bytes" : 7330877145,
        "total_data_set_size_in_bytes" : 7330877145,
        "reserved_in_bytes" : 0
      },
      "indexing" : {
        "index_total" : 0,
        "index_time_in_millis" : 0,
        "index_current" : 0,
        "index_failed" : 0,
        "delete_total" : 0,
        "delete_time_in_millis" : 0,
        "delete_current" : 0,
        "noop_update_total" : 0,
        "is_throttled" : false,
        "throttle_time_in_millis" : 0
      }
    }
  },
  "indices" : {
    "apm-7.15.0-span-000048" : {
      "uuid" : "KMHvD2_mQ9OYg2Eli4cq9w",
      "primaries" : {
        "store" : {
          "size_in_bytes" : 3665198005,
          "total_data_set_size_in_bytes" : 3665198005,
          "reserved_in_bytes" : 0
        },
        "indexing" : {
          "index_total" : 0,
          "index_time_in_millis" : 0,
          "index_current" : 0,
          "index_failed" : 0,
          "delete_total" : 0,
          "delete_time_in_millis" : 0,
          "delete_current" : 0,
          "noop_update_total" : 0,
          "is_throttled" : false,
          "throttle_time_in_millis" : 0
        }
      },
      "total" : {
        "store" : {
          "size_in_bytes" : 7330877145,
          "total_data_set_size_in_bytes" : 7330877145,
          "reserved_in_bytes" : 0
        },
        "indexing" : {
          "index_total" : 0,
          "index_time_in_millis" : 0,
          "index_current" : 0,
          "index_failed" : 0,
          "delete_total" : 0,
          "delete_time_in_millis" : 0,
          "delete_current" : 0,
          "noop_update_total" : 0,
          "is_throttled" : false,
          "throttle_time_in_millis" : 0
        }
      }
    }
  }
}

Mapping:

PUT apm-7.15.0-span-000048/_mapping
{
  "properties": {
    "span": {
      "properties": {
        "db": {
          "properties": {
            "statement": {
              "type": "text"
            }
          }
        }
      }
    }
  }
}

After:

GET apm-7.15.0-span-000048/_stats/indexing,store
{
  "_shards" : {
    "total" : 6,
    "successful" : 6,
    "failed" : 0
  },
  "_all" : {
    "primaries" : {
      "store" : {
        "size_in_bytes" : 3665198005,
        "total_data_set_size_in_bytes" : 3665198005,
        "reserved_in_bytes" : 0
      },
      "indexing" : {
        "index_total" : 0,
        "index_time_in_millis" : 0,
        "index_current" : 0,
        "index_failed" : 0,
        "delete_total" : 0,
        "delete_time_in_millis" : 0,
        "delete_current" : 0,
        "noop_update_total" : 0,
        "is_throttled" : false,
        "throttle_time_in_millis" : 0
      }
    },
    "total" : {
      "store" : {
        "size_in_bytes" : 7330877145,
        "total_data_set_size_in_bytes" : 7330877145,
        "reserved_in_bytes" : 0
      },
      "indexing" : {
        "index_total" : 0,
        "index_time_in_millis" : 0,
        "index_current" : 0,
        "index_failed" : 0,
        "delete_total" : 0,
        "delete_time_in_millis" : 0,
        "delete_current" : 0,
        "noop_update_total" : 0,
        "is_throttled" : false,
        "throttle_time_in_millis" : 0
      }
    }
  },
  "indices" : {
    "apm-7.15.0-span-000048" : {
      "uuid" : "KMHvD2_mQ9OYg2Eli4cq9w",
      "primaries" : {
        "store" : {
          "size_in_bytes" : 3665198005,
          "total_data_set_size_in_bytes" : 3665198005,
          "reserved_in_bytes" : 0
        },
        "indexing" : {
          "index_total" : 0,
          "index_time_in_millis" : 0,
          "index_current" : 0,
          "index_failed" : 0,
          "delete_total" : 0,
          "delete_time_in_millis" : 0,
          "delete_current" : 0,
          "noop_update_total" : 0,
          "is_throttled" : false,
          "throttle_time_in_millis" : 0
        }
      },
      "total" : {
        "store" : {
          "size_in_bytes" : 7330877145,
          "total_data_set_size_in_bytes" : 7330877145,
          "reserved_in_bytes" : 0
        },
        "indexing" : {
          "index_total" : 0,
          "index_time_in_millis" : 0,
          "index_current" : 0,
          "index_failed" : 0,
          "delete_total" : 0,
          "delete_time_in_millis" : 0,
          "delete_current" : 0,
          "noop_update_total" : 0,
          "is_throttled" : false,
          "throttle_time_in_millis" : 0
        }
      }
    }
  }
}

Ratio:

GET apm-7.15.0-span-000048/_search

Request:

{
  "size": 0,
  "aggs": {
    "wrapper": {
      "terms": {
        "field": "_index",
        "size": 10
      },
      "aggs": {
        "with_db_statement": {
          "filter": {
            "exists": {
              "field": "span.db.statement"
            }
          }
        },
        "total": {
          "filter": {
            "match": {
              "_index": "*"
            }
          }
        },
        "ratio": {
          "bucket_script": {
            "buckets_path": {
              "numerator": "with_db_statement>_count",
              "denominator": "total>_count"
            },
            "script": "params.numerator / params.denominator"
          }
        }
      }
    }
  }
}

Response:

{
  "size": 0,
  "aggs": {
    "wrapper": {
      "terms": {
        "field": "_index",
        "size": 10
      },
      "aggs": {
        "with_db_statement": {
          "filter": {
            "exists": {
              "field": "span.db.statement"
            }
          }
        },
        "total": {
          "filter": {
            "match": {
              "_index": "*"
            }
          }
        },
        "ratio": {
          "bucket_script": {
            "buckets_path": {
              "numerator": "with_db_statement>_count",
              "denominator": "total>_count"
            },
            "script": "params.numerator / params.denominator"
          }
        }
      }
    }
  }
}

Sadly it looks like this index had no db statement spans 😆

I'll see if I can dig around for a sample apm span index with more db statements around.

matschaffer avatar Nov 17 '21 07:11 matschaffer

Wonder if it might make sense to use @dgieselaar 's apm synthtrace to just generate a bunch of random db statements as part of the span data. Not as nice as "real" but looks like our biggest "real" data source doesn't record db statements at all today.

matschaffer avatar Nov 17 '21 07:11 matschaffer

Sadly it looks like this index had no db statement spans laughing

@matschaffer oh, bummer. Thanks for digging into it anyway.

Wonder if it might make sense to use @dgieselaar 's apm synthtrace to just generate a bunch of random db statements as part of the span data. Not as nice as "real" but looks like our biggest "real" data source doesn't record db statements at all today.

++

I'd like to use synthtrace (some kind of generator) to generate APM events, pass them through apm-server, and then use the resulting docs in Rally (https://github.com/elastic/apm-server/issues/6115). Then we can measure the impact of mapping changes like, while keeping the results in sync with the changes in how apm-server structures docs.

axw avatar Nov 17 '21 07:11 axw

Btw. from 7.15 you can get the actual disk usage per field, which is really useful to test mapping changes. https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-disk-usage.html

henrikno avatar Nov 18 '21 01:11 henrikno

Also wanted to note, match_only_text does support phrase queries, it'll just be slower since it needs to look at _source to verify ordering.

henrikno avatar Nov 18 '21 02:11 henrikno

I wanted to note that span.db.statement may contain sensitive information.

tobiasstadler avatar Nov 18 '21 09:11 tobiasstadler

@tobiasstadler we already store span.db.statement in the Elasticsearch document, it's just that we don't index it for fast searching. It's already possible to search the field using runtime fields by defining a runtime field in the search request, so I think the only difference would be the speed of the search. Is there something else about indexing it that increases security risk?

axw avatar Nov 18 '21 09:11 axw

No, I don't thin so. Sorry I didn't know/forgot that the value is stored already.

tobiasstadler avatar Nov 18 '21 09:11 tobiasstadler

No worries! Thanks for raising your concern anyway.

axw avatar Nov 18 '21 09:11 axw

I did the following in our test cluster:

PUT /apm-7.15.0-span-000001/_mapping
{
  "properties": {
    "span": {
      "properties": {
        "db": {
          "properties": {
            "statement": {
              "type": "text"
            }
          }
        }
      }
    }
  }
}

and

 POST /apm-7.15.0-span-000001/_update_by_query

Now

POST /apm-7.15.0-span-000001/_disk_usage?run_expensive_tasks=true

results in

...

"span.db.statement" : {
        "total" : "49.9mb",
        "total_in_bytes" : 52365481,
        "inverted_index" : {
          "total" : "35mb",
          "total_in_bytes" : 36720077
        },
        "stored_fields" : "0b",
        "stored_fields_in_bytes" : 0,
        "doc_values" : "0b",
        "doc_values_in_bytes" : 0,
        "points" : "0b",
        "points_in_bytes" : 0,
        "norms" : "14.9mb",
        "norms_in_bytes" : 15645404,
        "term_vectors" : "0b",
        "term_vectors_in_bytes" : 0
      },
...

apm-7.15.0-span-000001 has 13610386 docs with span.db.statement.

I hope this helps.

tobiasstadler avatar Nov 18 '21 14:11 tobiasstadler

Thanks @tobiasstadler! Those numbers suggest we have nothing to be concerned about, storage wise. It would be helpful to know what the indexing overhead is (I expect it's also fine).

axw avatar Nov 18 '21 23:11 axw

Is there a way to measure to overhead?

tobiasstadler avatar Nov 19 '21 06:11 tobiasstadler

@tobiasstadler using the Elasticsearch _stats API like in https://github.com/elastic/apm-server/issues/6586#issuecomment-971247963 (comparing before & after). However, I just saw that Mat's results in https://github.com/elastic/apm-server/issues/6586#issuecomment-971285980 and the indexing stats are all zero, so I'm not exactly sure what steps are necessary here.

axw avatar Nov 19 '21 07:11 axw

Sorry, but I do not have any "before" data.

tobiasstadler avatar Nov 19 '21 07:11 tobiasstadler

No worries, we'll run some experiments. Thanks all the same :)

axw avatar Nov 19 '21 07:11 axw

This will kinda be handled as a side-effect of https://github.com/elastic/apm-server/issues/11528, and the general move to dynamic mapping. I say "kinda" because the dynamic mapping rules will map this field as a keyword field, and not text. Still, it will be searchable (albeit more slowly), and folks will still be able to override this with a custom component template.

axw avatar Nov 23 '23 07:11 axw

See also https://github.com/elastic/apm-server/pull/12098

axw avatar Nov 23 '23 07:11 axw