feat: phlaredb parallel flushing
In our high-traffic Pyroscope ingester, flushing a single head takes 20–30 seconds, and sequentially flushing all heads can take minutes. During this period, new heads arrive, causing memory usage to increase, which keeps the GC busy and may even OOM.
Therefore, implementing parallel flushing is essential.
Changes:
Enable parallel flushing. Fix the unused multierror.
Thank you for the detailed explanation. @kolesnikovae
We have already scaled ingesters horizontally.
Lastly, at any given time, there should be very few heads being flushed (within a single PhlareDB instance/tenant) – typically around three: the current block time window, the next one, and the previous one. However, this depends on the configuration, mainly the ingestion window duration and the maximum block duration. If there are many heads to be flushed for a tenant, it's likely a bug or misconfiguration.
In our scenario, the number of flushing heads is 7~8.
ts=2025-06-14T22:25:15.68113863+08:00 caller=phlaredb.go:230 level=debug tenant=anonymous msg="flushing heads to disk" reason=max-block-bytes max_size="3.2 GB" current_size="3.3 GB" num_heads=7
ts=2025-06-14T22:31:05.68110972+08:00 caller=phlaredb.go:230 level=debug tenant=anonymous msg="flushing heads to disk" reason=max-block-bytes max_size="3.2 GB" current_size="3.3 GB" num_heads=8
The configuration is as follows:
pyroscopedb:
max_block_duration: 1m
hard code defaultParquetConfig.MaxBlockBytes = 3.2GB
I think the issue stems from a mismatch between staleHeadTicker := time.NewTimer(util.DurationWithJitter(10*time.Minute, 0.5)) and max_block_duration = 1m, which results in more than three flushing heads(1 minute 1 head).
I plan to test adjusting max_block_duration to 10 minutes to address this.
Hello @liaol,
Indeed, the configuration you shared is quite different from what we had in mind when designing Pyroscope :) Our current solution is built around a few key principles – primarily, keeping larger blocks in memory for longer periods of time, which helps reduce both write and read amplification.
By default, max_block_duration is set to 1h, and many components rely on that assumption. From our experience, things tend to stop working as intended when it's reduced to around 15m, and even a heavily tuned configuration may struggle to compensate for that.
Could you share more details about your deployment? Understanding the challenges you're facing at your scale would help us improve Pyroscope.
ts=2025-06-14T22:25:15.68113863+08:00 caller=phlaredb.go:230 level=debug tenant=anonymous msg="flushing heads to disk" reason=max-block-bytes max_size="3.2 GB" current_size="3.3 GB" num_heads=7
ts=2025-06-14T22:31:05.68110972+08:00 caller=phlaredb.go:230 level=debug tenant=anonymous msg="flushing heads to disk" reason=max-block-bytes max_size="3.2 GB" current_size="3.3 GB" num_heads=8
You may benefit from reverting max_block_duration to something more substantial – e.g., 1h. With that, blocks will still flush based on size, but you'll avoid the overhead of managing too many concurrent heads. In practice, this depends on your ingestion window duration and out-of-order ingestion. It's very expensive when data arrives after the corresponding block (based on the profile's timestamp) has already been flushed.
Also, ingesters scale very well – many our clusters have well over 100 nodes – and this is the preferred way to operate them in large-scale deployments.
We're also actively working on a new storage backend: a disk-less mode that writes data directly to object storage. This shifts the responsibility for durability to the object store provider and eliminates the need for expensive local disks (assuming a suitable replication factor). I assume your max_block_duration = 1m setting may be trying to achieve something similar.
You can find more here: https://github.com/grafana/pyroscope/tree/main/pkg/experiment. We've already tested it under heavy load (0.5-1GB/s, 10–20k RPS) and are happy with both performance and cost. We're planning to release it later this year.
Please let me know if you're interested – I'd be happy to answer any questions or help you evaluate it in your infrastructure.
Thanks, @kolesnikovae
Setting max_block_duration to 1h works like a charm. It also reduces S3 storage usage.
I assume your max_block_duration = 1m setting may be trying to achieve something similar.
No, it's just a mistake.
Here are more details about our scenario:
- We use opentelemetry-ebpf-profiler to collect profiles from all processes on our 256-core machines. The ebpf agent sends more than 10000 samples (approximately 4 MB after gzip compression) to Pyroscope every 5 seconds. Profiles are aggregated by service.name and routed to ingesters using consistent hashing.
- Due to performance limitations with the distributor, we send profiles directly to the ingester’s ingester.v1.IngesterService/Push endpoint. See https://github.com/grafana/pyroscope/issues/4133 for context.
- In load tests, a single 16-core, 64GB ingester pod can handle profiles from approximately 500 machines, equating to 100 QPS.
Additionally, I’ve added a compression option for Parquet in Pyroscope, which reduces disk or S3 storage usage by 50%. I’d be happy to submit a PR if contributions are welcome.
Thanks again!
Hi @liaol!
We use opentelemetry-ebpf-profiler to collect profiles from all processes on our 256-core machines. The ebpf agent sends more than 10000 samples (approximately 4 MB after gzip compression) to Pyroscope every 5 seconds. Profiles are aggregated by service.name and routed to ingesters using consistent hashing.
A word of caution: sending profiles too frequently creates unnecessary load. It's better to send 15-30s profiles: many stack traces will get aggregated and the overall overhead will be significantly lower
Due to performance limitations with the distributor, we send profiles directly to the ingester’s ingester.v1.IngesterService/Push endpoint. See https://github.com/grafana/pyroscope/issues/4133 for context.
Please note that without normalization and sanitization performed in distributors it's very likely that the stored profiles are malformed. It will cause compaction (if you use it) and query failures.
I'd recommend deploying more smaller instances of distributor: e.g., 32 2 CPU, 2 GB RAM. This is how we run Pyroscope ourselves.
Additionally, I’ve added a compression option for Parquet in Pyroscope, which reduces disk or S3 storage usage by 50%. I’d be happy to submit a PR if contributions are welcome.
This is super cool! Yes, PRs are always welcome! We haven't enabled compression because of the CPU costs (both in read and write paths). I'm wondering what algo you're using and how bad the CPU cost is. I think something like snappy, lz4, or zstd might be affordable.