datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Spilling aggregation consumes more memory than specified memory limit

Open 2010YOUY01 opened this issue 1 year ago • 0 comments

Describe the bug

Actual memory consumption remains high even if the memory limit is specified

To Reproduce

Compile and run datafusion-cli with:

# /usr/bin/time to measure actual memory usage
/usr/bin/time -l cargo run -- --mem-pool-type fair -m 1G -f 'aggr.sql'

And it showed max memory resident set size is ~400M

Next, set the memory limit to 200M and run it again, max RSS is still ~400M Use explain analyze can show that the spill actually happened And the query has ~1M groups, if changing it to 4 groups (see aggr.sql), max RSS is around 80M, the major memory consumer should be aggregation, so it's likely there is some issue with external aggregation implementation

aggr.sql

Change file path to parquet lineitem file generated by https://github.com/apache/datafusion/tree/main/benchmarks

-- aggr.sql
CREATE EXTERNAL TABLE IF NOT EXISTS lineitem (
        l_orderkey BIGINT,
        l_partkey BIGINT,
        l_suppkey BIGINT,
        l_linenumber INTEGER,
        l_quantity DECIMAL(15, 2),
        l_extendedprice DECIMAL(15, 2),
        l_discount DECIMAL(15, 2),
        l_tax DECIMAL(15, 2),
        l_returnflag VARCHAR,
        l_linestatus VARCHAR,
        l_shipdate DATE,
        l_commitdate DATE,
        l_receiptdate DATE,
        l_shipinstruct VARCHAR,
        l_shipmode VARCHAR,
        l_comment VARCHAR,
) STORED AS parquet
LOCATION '/Users/yongting/Desktop/code/my_datafusion/arrow-datafusion/benchmarks/data/tpch_sf1/lineitem';

SELECT
count(*)
FROM lineitem
GROUP BY l_suppkey, l_partkey, l_returnflag, l_linestatus; -- cardinality is ~36% of lineitem
-- GROUP BY l_returnflag, l_linestatus -- 4 groups

Expected behavior

No response

Additional context

No response

2010YOUY01 avatar Oct 15 '24 10:10 2010YOUY01