doris icon indicating copy to clipboard operation
doris copied to clipboard

[Bug] Memory Leak in BE (2.1.1)

Open IanMeta opened this issue 10 months ago • 3 comments

Search before asking

  • [X] I had searched in the issues and found no similar issues.

Version

2.1.1

What's Wrong?

The memory usage in BE nodes is not going down despite having no queries, eventually causing OOM errors.

Process Memory Summary:
    os physical memory 62.48 GB. process memory used 54.59 GB, limit 56.23 GB, soft limit 50.61 GB. sys available memory 7.94 GB, low water mark 1.60 GB, warning water mark 3.20 GB. Refresh interval memory growth 0 B
Memory Tracker Summary:
    Type=experimental, Used=0(0 B), Peak=0(0 B)
    Type=clone, Used=0(0 B), Peak=0(0 B)
    Type=schema_change, Used=0(0 B), Peak=0(0 B)
    Type=compaction, Used=0(0 B), Peak=266.05 MB(278973334 B)
    Type=load, Used=1.02 MB(1068375 B), Peak=796.93 MB(835637784 B)
    Type=query, Used=1.67 GB(1788834344 B), Peak=9.56 GB(10261733312 B)
    Type=global, Used=10.91 GB(11719054937 B), Peak=11.12 GB(11941355158 B)
    Type=tc/jemalloc cache, Used=2.95 GB(3166464064 B), Peak=-1.00 B(-1 B)
    Type=sum of all trackers, Used=15.53 GB(16675421720 B), Peak=-1.00 B(-1 B)
    Type=process resident memory, Used=54.59 GB(58613862400 B), Peak=59.54 GB(63934939136 B)
    Type=process virtual memory, Used=125.11 GB(134335639552 B), Peak=125.26 GB(134498238464 B)
    MemTrackerLimiter Label=Orphan, Type=global, Limit=-1.00 B(-1 B), Used=-26.73 MB(-28027133 B), Peak=303.96 MB(318723809 B)
    MemTracker Label=PageNoCache, Parent Label=Orphan, Used=0(0 B), Peak=5.54 MB(5812293 B)
    MemTracker Label=IOBufBlockMemory, Parent Label=Orphan, Used=80.47 MB(84377600 B), Peak=379.45 MB(397885440 B)
    MemTracker Label=OlapTablePartitionParam, Parent Label=Orphan, Used=37.81 KB(38717 B), Peak=37.81 KB(38717 B)
    MemTracker Label=OlapTablePartitionParam, Parent Label=Orphan, Used=12.46 KB(12758 B), Peak=12.46 KB(12758 B)
    MemTrackerLimiter Label=DataPageCache[size], Type=global, Limit=-1.00 B(-1 B), Used=10.08 GB(10820010657 B), Peak=10.09 GB(10831068881 B)
    MemTrackerLimiter Label=IndexPageCache[size], Type=global, Limit=-1.00 B(-1 B), Used=875.23 MB(917741807 B), Peak=1.05 GB(1125460240 B)
    MemTrackerLimiter Label=PKIndexPageCache[size], Type=global, Limit=-1.00 B(-1 B), Used=8.90 MB(9329606 B), Peak=8.90 MB(9329881 B)
    MemTrackerLimiter Label=PointQueryRowCache[size], Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B)
    MemTrackerLimiter Label=SegmentCache[number], Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B)
    MemTrackerLimiter Label=SchemaCache[number], Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B)
    MemTrackerLimiter Label=CommonObjLRUCache[number], Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B)
    MemTrackerLimiter Label=PointQueryLookupConnectionCache[size], Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B)
    MemTrackerLimiter Label=InvertedIndexSearcherCache[size], Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B)
    MemTrackerLimiter Label=InvertedIndexQueryCache[size], Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B)
    MemTrackerLimiter Label=LastSuccessChannelCache[size], Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B)
    MemTrackerLimiter Label=TabletSchemaCache[number], Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B)
    MemTrackerLimiter Label=MowTabletVersionCache[number], Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B)
    MemTrackerLimiter Label=CreateTabletRRIdxCache[number], Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B)
    MemTrackerLimiter Label=MowDeleteBitmapAggCache[size], Type=global, Limit=-1.00 B(-1 B), Used=0(0 B), Peak=0(0 B)
    MemTrackerLimiter Label=Query#Id=ac081d7797d48cd-a56f3704e889aa06, Type=query, Limit=4.00 GB(4294967296 B), Used=1.07 GB(1147333896 B), Peak=1.07 GB(1147333896 B)
    MemTrackerLimiter Label=Query#Id=6c82355087b344f8-95ca79062bd3f9ab, Type=query, Limit=4.00 GB(4294967296 B), Used=415.44 MB(435619088 B), Peak=542.06 MB(568386384 B)
    MemTrackerLimiter Label=Query#Id=e034ec09164448d6-ab0162806f12a6ce, Type=query, Limit=4.00 GB(4294967296 B), Used=151.81 MB(159184496 B), Peak=196.45 MB(205996712 B)
    MemTrackerLimiter Label=Query#Id=e999d27ee8564b55-88fcf8a5ce674a4c, Type=query, Limit=4.00 GB(4294967296 B), Used=9.78 MB(10258400 B), Peak=387.29 MB(406098440 B)
    MemTrackerLimiter Label=Query#Id=f0a61bf0794a431c-b240dc126b70f3bd, Type=query, Limit=4.00 GB(4294967296 B), Used=2.86 MB(2998880 B), Peak=2.86 MB(2998880 B)
    MemTrackerLimiter Label=Query#Id=f8ed6a05972d46ce-befc122aaeec737a, Type=query, Limit=4.00 GB(4294967296 B), Used=2.66 MB(2785984 B), Peak=2.66 MB(2785984 B)
    MemTrackerLimiter Label=Load#Id=dba1f487c9984fc7-a3a2ca1b00858c1b, Type=load, Limit=2.00 GB(2147483648 B), Used=1.02 MB(1067040 B), Peak=1.02 MB(1067040 B)
    MemTrackerLimiter Label=Query#Id=c6ce6322eb8c4972-ba7493587bd4e4fd, Type=query, Limit=4.00 GB(4294967296 B), Used=1.02 MB(1066464 B), Peak=1.02 MB(1066464 B)
    MemTrackerLimiter Label=Query#Id=3d2c80eadedd4064-90742fed62dac36c, Type=query, Limit=4.00 GB(4294967296 B), Used=568.16 KB(581792 B), Peak=36.08 MB(37831168 B)
    MemTrackerLimiter Label=Query#Id=dc785dcc9e53482b-848ba83a6db1d758, Type=query, Limit=4.00 GB(4294967296 B), Used=52.28 KB(53536 B), Peak=52.28 KB(53536 B)
    MemTrackerLimiter Label=Query#Id=34f95a4627fe4b1d-bbede7c6fd4592a1, Type=query, Limit=4.00 GB(4294967296 B), Used=52.28 KB(53536 B), Peak=52.28 KB(53536 B)
    MemTrackerLimiter Label=Query#Id=5d59571ce9344bc9-ab1fcc143afb6274, Type=query, Limit=4.00 GB(4294967296 B), Used=52.28 KB(53536 B), Peak=52.28 KB(53536 B)
    MemTrackerLimiter Label=Query#Id=1f76a85adba642a0-95ee7b617ccdda48, Type=query, Limit=4.00 GB(4294967296 B), Used=52.28 KB(53536 B), Peak=52.28 KB(53536 B)
    MemTrackerLimiter Label=Query#Id=d57769bc809e447a-93d128cf73c5c4ed, Type=query, Limit=4.00 GB(4294967296 B), Used=52.28 KB(53536 B), Peak=52.28 KB(53536 B)
    MemTrackerLimiter Label=Query#Id=527029cc34fb416d-a1526defe8f5ac7f, Type=query, Limit=4.00 GB(4294967296 B), Used=52.28 KB(53536 B), Peak=52.28 KB(53536 B)

According to https://doris.apache.org/docs/admin-manual/maint-monitor/memory-management/memory-tracker, the process memory should be the sum of other types of memory, but we see from mem_tracker that there is 20+GB of memory unaccounted for comparing between process resident memory and sum of other types of memory (can anyone explain how to interpret this?) image

From our Grafana monitoring BE statistics , we can see that some memory is released after a burst of queries, but the RAM usage returns to a level slightly higher than before, and over time it trends upwards until it runs out of memory.

image

What You Expected?

Memory levels should return to normal when there are no queries.

How to Reproduce?

We currently do not have the server resources to reproduce the exact setup, but I believe the issue can be produced by doing some stress tests in a production environment in Doris 2.1.1.

Anything Else?

We also tried setting memory_mode=compact in be.conf, but the problem persists. Currently, we restart the BE nodes periodically to clear previously unreleased cache.

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

IanMeta avatar Apr 26 '24 07:04 IanMeta

Please try 2.1.2 because there is an known memory leak in 2.1.1 during aggregation.

yiguolei avatar Apr 26 '24 07:04 yiguolei

Please try 2.1.2 because there is an known memory leak in 2.1.1 during aggregation.

We just upgraded to Doris 2.1.2 and still observed this trend (see image below). image

Previously, we used version 2.0.0 and memory was never above 60% at idle. Is it intended to use more RAM or is this still a memory leak? Are there any configuration options we can try?

IanMeta avatar Apr 29 '24 08:04 IanMeta

@yiguolei I have the same issue. version: 2.1.2

Power098 avatar Apr 30 '24 10:04 Power098

Pasting my BE config java options here, just in case it's relevant.

JAVA_OPTS="-Xmx4096m -XX:-UseGCOverheadLimit -DlogPath=$DORIS_HOME/log/jni.log -Xloggc:$DORIS_HOME/log/be.gc.log.$CUR_DATE -Djavax.security.auth.useSubjectCredsOnly=false -Dsun.security.krb5.debug=true -Dsun.java.command=DorisBE -XX:-CriticalJNINatives -DJDBC_MIN_POOL=1 -DJDBC_MAX_POOL=100 -DJDBC_MAX_IDLE_TIME=300000 -DJDBC_MAX_WAIT_TIME=5000"

IanMeta avatar May 03 '24 10:05 IanMeta

@yiguolei I have the same issue. version: 2.1.2

me too

kingsylin avatar Jun 02 '24 15:06 kingsylin

me too. version 2.1.3 version 2.1.4

INNOCENT-BOY avatar Jun 21 '24 10:06 INNOCENT-BOY

This bug should be fixed in 2.1.4. If you find memory leak still exists in 2.1.4, please provide more information.

Gabriel39 avatar Jun 27 '24 01:06 Gabriel39

Hi @Gabriel39, My case is different from this issue. And I create a pr to resolve my issue, please help to review it: https://github.com/apache/doris/pull/36966

INNOCENT-BOY avatar Jun 28 '24 02:06 INNOCENT-BOY

Hi @kingsylin Have you upgraded to 2.1.4? Has the problem improved?

PeatBoy avatar Jun 29 '24 15:06 PeatBoy

Doris version 2.1.5 is released Have anyone test the new version to see if the problem is resolved?

patrickdung2022 avatar Jul 26 '24 02:07 patrickdung2022

Has the problem improved

Yes, I have upgraded to 2.1.4, but problem is still there

kingsylin avatar Aug 08 '24 08:08 kingsylin