sybil icon indicating copy to clipboard operation
sybil copied to clipboard

Make distributed queries a reality

Open okayzed opened this issue 7 years ago • 1 comments

Distributed sybil

This is an umbrella task for working towards distributed queries with sybil as an aggregator.

Steps to get there

Deployment

  • setting up docker / k8s for deploying sybil + a small network wrapper for ingesting and querying

Ingestion

  • scribe (or kafka) pipes + ingestors

Aggregation

  • aggregation of results from leaf nodes
  • sharding + recovery of downed nodes
  • client for querying from master aggregator

Pruning

  • distributed pruning of data - should be a histogram query on the time + size fields followed by a trim command: "delete all data before X time", where X is deduced from the histogram query.

Current Problem Areas

  • TBD

Fixed Areas

  • assumes same min/max values for int columns, so histogram bins are even sized on one leaf, but may be different across leaves. need to switch to more mergeable histograms when sending result from leaf -> aggregator (like t-digest)

  • large cardinality + count distinct: currently implemented as len(Results). obviously not going to work for high cardinality

  • large cardinality time series: if we make time series for top 10 on a leaf, we still need to send a buffer of extra series just in case to the aggregator.

okayzed avatar Oct 10 '17 17:10 okayzed

Steps that we've taken to get to distributed queries

  • lower mem usage during aggregation
  • switch to using loglogbeta for count distinct queries (definitely saves RAM)
  • add encoder for writing querySpec to stdout as binary
  • add aggregator for reading query specs from a dir and aggregating them

okayzed avatar Nov 15 '17 17:11 okayzed