tempo Parquet Blocks Production Readiness

Parquet Blocks Production Readiness

Open annanay25 opened this issue 2 years ago • 0 comments

Meta issue to track production readiness of the Apache Parquet block format.

[ ] Fix FindTraceByID [@mdisibio]
- [x] Fix panic when scanning rowgroups with multiple pages
- [x] Current bug with find by ID logic? (https://github.com/grafana/tempo/pull/1531)
[x] Serverless build
- [x] Make a decision on Google Cloud Run vs. Cloud Functions
- [x] Swap serverless querying to vParquet
- [x] Rework our CI pipeline
[ ] Caching [@annanay25]
- [ ] Put metadata/footer in real cache.
[ ] Fine tuning [@joe-elliott]
- [ ] Row group size (100MB? 10000 traces?)
- [ ] Read buffer size/count (1MB x 16 currently)
- [ ] Search shard target bytes per job (30MB, 100MB?)
- [ ] Sort inner trace content: Sort batches by span name - does this significantly improve the common case of searching for span by name?
[ ] Compaction [@mdisibio]
- [x] Pre-fetch before combining?
- [ ] Fix combiner metrics -- move combiner into versioned encodings
[ ] Performance/Stability testing in highly multitenant environments
[ ] Upstream issues: [@mdisibio]
- [x] https://github.com/segmentio/parquet-go/issues/250
- [x] https://github.com/segmentio/parquet-go/issues/254

[ ] CLI (prints schema & column sizes already) [@mdisibio]
- [x] iterate through a column and debug print all values https://github.com/grafana/tempo/pull/1531

The following are out of scope for production readiness of parquet blocks, rather its a rough roadmap for the parquet block format.

[ ] Cache specific resource attribute columns? cluster / namespace / service.name ..
[ ] Switch WAL over to vParquet
[ ] Dynamic column support
[ ] Implement TraceQL over Parquet blocks.

Jun 09 '22 07:06 annanay25