Allow updating segment partition metadata in consuming segments
We were trying to test partitioning data in REALTIME tables and found consuming segments were never getting queried when applying the filter where <partition_column> = X. We eventually traced it down to this:
- we have our own stream ingestion plugin that is ingesting data from multiple kafka sources
- to do so, it uses an algorithm we wrote to convert the upstream config into a pinot partition int id
- therefore,
murmur(partition_column) % num_partitionswill never actually equal partition id
But what we saw is, after the segment is sealed, pinot actually knows to update the partition metadata with our made up id plus the actual partition it saw in the data. Having this functionality on the consuming segment would be really useful for us.
For redundancy, we actually want to repartition our upstream data into 4 separate sources not just 1. So we want each <partition_column> to exist on 4 pinot partitions. Since pinot already knows to update the segment metadata on seal, can it just do it in realtime for consuming segments? There should only be 1 thread touching a consuming segment at a time anyway.
The other proposal is to offer disabling of partitioning on the consuming segment. But this really defeats the purpose of what we're trying to do as it greatly increases the number of segments we query.