Tag indexing whitelist
After switching to Span V2, Zipkin stores all {,binary} annotations as tags and save them for indexing if the value is less than 256 bytes long. This, however, is problematic sometimes, since some tag values are very random (bad for indexer), and are only recorded to provide more information for a span to help debugging (e.g. HTTP URI or Message ID).
It would be good if Zipkin has a configurable whitelist for tags to be indexed, or there is something like binary annotation in the spec.
thanks for raising this. you aren't the first. I think the impl will be the key in annotationQuery from the http api. This can either be a tag key or a timestamp annotation. I think a list of these will be fine (for implementations that do manual indexing like ES and Cassandra)
cc @llinder @michaelsembwever @anuraaga @devinsba
I'm starting to wonder if an indexconfig type wouldn't be helpful, because right now, we have the following indexes:
- serviceName
- spanName
- remoteServiceName
- annotations (non-core implicitly, ex not "cs" "sr" etc)
- tags (including auto-complete controlled by key name)
In all cases, these are keywords, so in java a keyword based predicate could be used.
Ex.
- KeywordFilter.NONE (disable annotation indexing)
- KeywordFilter.SHORT (<256 characters, which is the current constraint)
- KeywordFilter.in(list) (would match auto-complete tags)
While I mention this in java, the above could map to properties or a json struct. In any case a structure could allow disabling certain things like spanName (as requested by @mrajah-twttr) or controlling remoteServiceName (as requested by @drolando).
In any case, we'd have the chance to disable annotation indexing which has proven itself probably more burden than ever helpful. For example, we've never had a means to disable keyword searching for annotations, and doing this could easily add useless load when tools like open-telemetry start adding more annotations than we'd use or expect.
For example, OpenTelemetry events literally embed json in the annotation name, but still could be under the 255 char limit adding load to c* and elastic. This is not unlike problems we had in OpenTracing
Concretely, this literal example OpenTelemetry use would end up indexed in ES or Cassandra:
"my-event-name": { "key1" : "value1", "key2": "value2" }
So in short, it might make sense to create a keyword based struct, which at least allows disabling annotations as currently we should expect to see much more low-value load caused by OpenTelemetry otherwise.
cc @openzipkin/core