gpdb
gpdb copied to clipboard
High memory consumption of processing highly-partitioned column-oriented tables.
Hi, guys. Processing of each column needs a buffer which is a triple of block size. With another allocations (in case we use default block size) we need at least 98360 bytes to process one column. Additionally, we need at least 16384 bytes to process each partition. Then, for example, to process DynamicSeqScan node over a table with 1000 partitions and 30 columns we need to allocate and keep at once
1000 * 16384 + 1000 * 30 * 98360 = 2967184000 bytes or 2967 mb
of memory.
Simply destroying of datumstreams after processing of each partition is not a solution, because we have ReScan's. And this line points us to this commit, which explains new behavior.
My question is: Was high memory consumption taken into account? Mentioned patch has a comment about init-clean for each SeqScan as an alternative solution. If we can't do so, how can we decrease memory consumption and should we?