sybil
sybil copied to clipboard
Make distributed queries a reality
Distributed sybil
This is an umbrella task for working towards distributed queries with sybil as an aggregator.
Steps to get there
Deployment
- setting up docker / k8s for deploying sybil + a small network wrapper for ingesting and querying
Ingestion
- scribe (or kafka) pipes + ingestors
Aggregation
- aggregation of results from leaf nodes
- sharding + recovery of downed nodes
- client for querying from master aggregator
Pruning
- distributed pruning of data - should be a histogram query on the time + size fields followed by a trim command: "delete all data before X time", where X is deduced from the histogram query.
Current Problem Areas
- TBD
Fixed Areas
-
assumes same min/max values for int columns, so histogram bins are even sized on one leaf, but may be different across leaves. need to switch to more mergeable histograms when sending result from leaf -> aggregator (like t-digest)
-
large cardinality + count distinct: currently implemented as len(Results). obviously not going to work for high cardinality
-
large cardinality time series: if we make time series for top 10 on a leaf, we still need to send a buffer of extra series just in case to the aggregator.
Steps that we've taken to get to distributed queries
- lower mem usage during aggregation
- switch to using loglogbeta for count distinct queries (definitely saves RAM)
- add encoder for writing querySpec to stdout as binary
- add aggregator for reading query specs from a dir and aggregating them