zipkin icon indicating copy to clipboard operation
zipkin copied to clipboard

Zipkin API - For each service in Zipkin, returning total number of traces where the service is the root

Open msmsimondean opened this issue 3 years ago • 3 comments

Feature

A new Zipkin API feature for retrieving for all services, the number of traces and/or spans where a given serv is the root.

Rationale

When the amount of Elasticsearch diskspace has grown over time (e.g. of X GB of diskspace now remains), working out which services (via their traffic and sampling rate) are "using" the most space in Zipkin's Elasticsearch indecies.

At the moment, I have to get direct access to Elasticsearch and run manual queries to achieve this. It would be great if this could be done via the Zipkin API. If it could be done via the Zipkin API it could be done in a manual and an automation fashion. For automation, custom checks/alerts and dashboards could use the data returned by the Zipkin API.

Example Scenario

See Rationale above.

Prior Art

I will reply on this issue with an example Elasticsearch query for how I find out this data currently.

msmsimondean avatar Mar 12 '21 13:03 msmsimondean

Here's the Elasticsearch search and node.js script I used to get this data at the moment (not being able to get this through Zipkin itself):

Elasticsearch Search

POST https://some-eleasticsearch-host/zipkin-span-*/_search
 
{
  "size": 0,
  "query": {
    "bool": {
      "must_not": {
        "exists": {
          "field": "parentId"
        }
      }
    }
  },
  "aggs": {
    "filter": {
      "terms": {
        "field": "_index",
        "size": 1000
      },
      "aggs": {
        "localEndpoint.serviceName": {
          "terms": {
            "field": "localEndpoint.serviceName",
            "size": 1000
          }
        }
      }
    }
  }

node.js script for converting Elasticsearch search response into CSV

'use strict';
 
const fs = require('fs')
 
const results = JSON.parse(fs.readFileSync('zipkin_elasticsearch_service_trace_counts.json'))
const services = {}
const dateTextSet = new Set()
 
results.aggregations.filter.buckets.forEach(indexBucket => {
    const dateText = /^zipkin[-a-z0-9]*-span-(\d{4}-\d{2}-\d{2})$/.exec(indexBucket.key)[1]
    dateTextSet.add(dateText)
    indexBucket['localEndpoint.serviceName'].buckets.forEach(serviceBucket => {
        const serviceName = serviceBucket.key
        let service = services[serviceName]
 
        if (!service) {
            service = {}
            services[serviceName] = service
        }
 
        service[dateText] = serviceBucket.doc_count
    })
})
 
const dateTexts = Array.from(dateTextSet).sort()
const csv = []
csv.push('service,' + dateTexts.join())
 
Object.keys(services).sort().forEach(serviceName => {
    let line = serviceName
    const service = services[serviceName]
    dateTexts.forEach(dateText => {
        const count = service[dateText]
        line += `,${count ? count : 0}`
    })
    csv.push(line)
})
 
fs.writeFileSync('zipkin_elasticsearch_service_trace_counts.csv', csv.join('\n'))

msmsimondean avatar Mar 12 '21 16:03 msmsimondean

Hi, thanks for the suggestion. Is this feature more related to being able to do housekeeping on the ES storage ?

jorgheymans avatar Apr 01 '21 06:04 jorgheymans

Hi @jorgheymans. Not really housekeeping of ES storage as such. As in, it wouldn't be to find out what can be deleted from ES. More to confirm no unexpected/stray tracing by services/components. E.g. to find services/components that are producing more traces than you would expect which could be due to things like:

  1. A service/component has its sample rate/probability set too high
  2. A bug that is causing a service/component to generate too many traces/spans for some reason - e.g. recording the same thing multiple times or has is using too fine grained spans

Having written the above, it would probably be useful for the Zipkin API to return both i) number of traces instigated by each service/component (where that service/component appears at the root of those traces) and ii) number of spans recorded by each service/component.

Based on what the API showed, outcomes would be things like:

  1. Increasing ES storage capacity as we need to store more genuine traces/spans
  2. Decreasing sample rates/probabilities
  3. Fixing technical issues causing errant traces/spans

msmsimondean avatar Apr 06 '21 19:04 msmsimondean