gohangout gohangout 内存占用异常

版本 1.7.8

有时候会去到 7G 10G，docker stats 后面会降下来，这样内存会很不好限制，如果时从kafka 取了过多数据，能不能反压呢，不然通过cgroup 限制内存容器忙时会挂掉后起不来，不限制又影响宿主资源分配，pprof 默认没开，dump 了一个3g 使用内存的出来

input 这里有另外两个topic ，dump 后我删掉了，这两个本身topic 已经没有任何数据了

inputs:
    - Kafka:
        topic:
            saimaan_g18_logtail: 1
        codec: 'json'
        consumer_settings:
            bootstrap.servers: "saimaan.brokers.canal.netease.com:9092"
            group.id: 'logstash_g18_logtail'
            #auto.commit: 'false'
filters:
    - Convert:
        fields:
            batch_offset:
                to: int
            batch_ts:
                to: int
outputs:
    - Elasticsearch:
        sniff:
            refresh_interval: 60
            match: 'EQ($.attributes.zone,"hot")'
        hosts:
            - 'http://7.32.146.90:9200'
            - 'http://7.32.146.117:9200'
        index: 'g18_logtail_write'
        index_type: '_doc'
        concurrent: 4
        bulk_actions: 36000
        bulk_size: 30
        flush_interval: 60

May 11 '22 10:05 zcola

3.zip

May 11 '22 10:05 zcola

@zcola --worker 给的多少？ topic 有几个 partition？

May 12 '22 07:05 childe

看了一下 Dump 出来的数据，如下，内存使用1G。可以看到最大内存使用还是 ES Output 里面要Bulk的数据，共 500M，占50%左右。 500/30/(4+1) = 3.3 , 估计是你开了 4 个 worker ?

另外一个是 JsonDecode，也占了近50%，这里是使用的另外一个开源库 github.com/json-iterator/go，这一块还不太清楚内存的具体占用为什么这么多。pprof 里面使用 web 命令可以看到是 decode kafka input。

go tool pprof 3
File: gohangout
Type: inuse_space
Time: May 11, 2022 at 4:31pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 1071.93MB, 97.24% of 1102.41MB total
Dropped 35 nodes (cum <= 5.51MB)
Showing top 10 nodes out of 52
      flat  flat%   sum%        cum   cum%
     496MB 44.99% 44.99%      496MB 44.99%  github.com/childe/gohangout/output.newElasticsearchOutput.func2
  203.72MB 18.48% 63.47%   203.72MB 18.48%  reflect.mapassign
   95.53MB  8.67% 72.14%   116.03MB 10.52%  github.com/json-iterator/go.(*Iterator).Read.func2
   85.50MB  7.76% 79.89%   136.58MB 12.39%  github.com/json-iterator/go.(*Iterator).ReadString
      52MB  4.72% 84.61%   250.60MB 22.73%  github.com/json-iterator/go.(*Iterator).Read
   51.08MB  4.63% 89.24%    51.08MB  4.63%  github.com/json-iterator/go.(*Iterator).readStringSlowPath
   26.60MB  2.41% 91.66%    26.60MB  2.41%  github.com/childe/healer.(*Broker).requestStreamingly
   25.51MB  2.31% 93.97%    25.51MB  2.31%  bytes.makeSlice
      24MB  2.18% 96.15%   286.21MB 25.96%  github.com/childe/gohangout/output.(*ElasticsearchOutput).Emit
      12MB  1.09% 97.24%   510.83MB 46.34%  github.com/childe/gohangout/codec.(*JsonDecoder).Decode
(pprof)

May 12 '22 08:05 childe

至于为什么显示占用了3G内存，应该是GO回收之后的内存没有马上还给操作系统，而是留着准备以后用。我还没有查到什么情况下会还给操作系统。使用 Cgroup给相应的限制，可能会触发还给内存的行为（不确定，还不太明白）？但Cgroup要设置好，否则可能就OOM了。看了一些资料，比如 https://github.com/golang/go/issues/33376

May 12 '22 08:05 childe

--worker

64 的kafka partition ，hangout 4 个worker

May 13 '22 03:05 zcola

现在就是cgroup 内存我不敢设置啊，假设我设置个4g，忙时用到 7G 多，oom 掉后进程一起来就oom

May 13 '22 03:05 zcola

而且我发现从4个worker 改为1个（去掉 --worker 4 用默认），内存占用一样会很高

May 20 '22 08:05 zcola

4.zip

May 20 '22 08:05 zcola

 go tool pprof http://127.0.0.1:8899/debug/pprof/heap
Fetching profile over HTTP from http://127.0.0.1:8899/debug/pprof/heap
Saved profile in /root/pprof/pprof.gohangout.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz
File: gohangout
Type: inuse_space
Time: May 20, 2022 at 8:40am (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 1106.17MB, 97.33% of 1136.49MB total
Dropped 18 nodes (cum <= 5.68MB)
Showing top 10 nodes out of 47
      flat  flat%   sum%        cum   cum%
  361.89MB 31.84% 31.84%   361.89MB 31.84%  reflect.mapassign
     279MB 24.55% 56.39%      279MB 24.55%  github.com/childe/gohangout/output.newElasticsearchOutput.func2
  156.01MB 13.73% 70.12%   195.62MB 17.21%  github.com/json-iterator/go.(*Iterator).ReadString
      92MB  8.10% 78.21%   324.15MB 28.52%  github.com/json-iterator/go.(*Iterator).Read
   81.52MB  7.17% 85.39%   117.52MB 10.34%  github.com/json-iterator/go.(*Iterator).Read.func2
   39.62MB  3.49% 88.87%    39.62MB  3.49%  github.com/json-iterator/go.(*Iterator).readStringSlowPath
      34MB  2.99% 91.87%   325.31MB 28.62%  github.com/childe/gohangout/output.(*ElasticsearchOutput).Emit
   23.91MB  2.10% 93.97%    23.91MB  2.10%  bytes.makeSlice
   20.21MB  1.78% 95.75%    20.21MB  1.78%  github.com/childe/healer.(*Broker).requestStreamingly
      18MB  1.58% 97.33%   760.04MB 66.88%  github.com/childe/gohangout/codec.(*JsonDecoder).Decode

May 20 '22 08:05 zcola

同一个消费组，用了5g 内存的

 go tool pprof 127.0.0.1:8899/debug/pprof/heap
Fetching profile over HTTP from http://127.0.0.1:8899/debug/pprof/heap
Saved profile in /root/pprof/pprof.gohangout.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz
File: gohangout
Type: inuse_space
Time: May 20, 2022 at 9:37am (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 2.10GB, 96.34% of 2.18GB total
Dropped 21 nodes (cum <= 0.01GB)
Showing top 10 nodes out of 52
      flat  flat%   sum%        cum   cum%
    0.65GB 29.89% 29.89%     0.65GB 29.89%  reflect.mapassign
    0.54GB 25.02% 54.91%     0.54GB 25.02%  github.com/childe/gohangout/output.newElasticsearchOutput.func2
    0.32GB 14.73% 69.64%     0.38GB 17.49%  github.com/json-iterator/go.(*Iterator).ReadString
    0.18GB  8.18% 77.82%     0.26GB 12.04%  github.com/json-iterator/go.(*Iterator).Read.func2
    0.17GB  7.60% 85.42%     0.65GB 29.67%  github.com/json-iterator/go.(*Iterator).Read
    0.08GB  3.50% 88.92%     0.67GB 30.79%  github.com/childe/gohangout/output.(*ElasticsearchOutput).Emit
    0.06GB  2.77% 91.68%     0.06GB  2.77%  github.com/json-iterator/go.(*Iterator).readStringSlowPath
    0.06GB  2.62% 94.31%     0.06GB  2.62%  bytes.makeSlice
    0.03GB  1.43% 95.74%     1.44GB 66.02%  github.com/childe/gohangout/codec.(*JsonDecoder).Decode
    0.01GB   0.6% 96.34%     0.01GB   0.6%  github.com/childe/healer.(*Broker).requestStreamingly

May 20 '22 09:05 zcola

同一个消费组，用了5g 内存的

 go tool pprof 127.0.0.1:8899/debug/pprof/heap
Fetching profile over HTTP from http://127.0.0.1:8899/debug/pprof/heap
Saved profile in /root/pprof/pprof.gohangout.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz
File: gohangout
Type: inuse_space
Time: May 20, 2022 at 9:37am (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 2.10GB, 96.34% of 2.18GB total
Dropped 21 nodes (cum <= 0.01GB)
Showing top 10 nodes out of 52
      flat  flat%   sum%        cum   cum%
    0.65GB 29.89% 29.89%     0.65GB 29.89%  reflect.mapassign
    0.54GB 25.02% 54.91%     0.54GB 25.02%  github.com/childe/gohangout/output.newElasticsearchOutput.func2
    0.32GB 14.73% 69.64%     0.38GB 17.49%  github.com/json-iterator/go.(*Iterator).ReadString
    0.18GB  8.18% 77.82%     0.26GB 12.04%  github.com/json-iterator/go.(*Iterator).Read.func2
    0.17GB  7.60% 85.42%     0.65GB 29.67%  github.com/json-iterator/go.(*Iterator).Read
    0.08GB  3.50% 88.92%     0.67GB 30.79%  github.com/childe/gohangout/output.(*ElasticsearchOutput).Emit
    0.06GB  2.77% 91.68%     0.06GB  2.77%  github.com/json-iterator/go.(*Iterator).readStringSlowPath
    0.06GB  2.62% 94.31%     0.06GB  2.62%  bytes.makeSlice
    0.03GB  1.43% 95.74%     1.44GB 66.02%  github.com/childe/gohangout/codec.(*JsonDecoder).Decode
    0.01GB   0.6% 96.34%     0.01GB   0.6%  github.com/childe/healer.(*Broker).requestStreamingly

看起来重启都没啥用，立刻就5g 了

May 20 '22 09:05 zcola

我查了一些文档，早期的Golang版本是感知不到Cgroup的内存限制的。你是下载的我编译好的二进制？还是自己编译的？你的Golang 版本是多少？

May 21 '22 07:05 childe

http://127.0.0.1:8899/debug/pprof/heap?debug=1

这个也看一下，页面最下面的 runtime.MemStats 数据是多少？发来看一下吧。

May 21 '22 07:05 childe

CONTAINER ID   NAME                                              CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O    PIDS
2b8338b93279   hangout-g18_logtail.1.gn4i9jgw8xz83jwizop5a62pu   342.64%   918.7MiB / 125.5GiB   0.72%     12.6TB / 2.65TB   274kB / 0B   73
ps aux|grep g18_logtail
root     35970  0.0  0.0  13544  2676 pts/0    S+   14:44   0:00 grep g18_logtail
root     38846  236  4.7 7201488 6269212 ?     Ssl  5月11 40360:41 gohangout -pprof=true --worker 4 --config /mnt/hangout/g18_logtail.yml -logtostderr

# runtime.MemStats
# Alloc = 499093216
# TotalAlloc = 458898950038728
# Sys = 6713593424
# Lookups = 0
# Mallocs = 2001941848405
# Frees = 2001937800037
# HeapAlloc = 499093216
# HeapSys = 6369116160
# HeapIdle = 5596094464
# HeapInuse = 773021696
# HeapReleased = 4326498304
# HeapObjects = 4048368
# Stack = 6225920 / 6225920
# MSpan = 9607448 / 73842688
# MCache = 83328 / 98304
# BuckHashSys = 2286772
# GCSys = 247975824
# OtherSys = 14047756
# NextGC = 887594016
# LastGC = 1653288332185470574

May 23 '22 06:05 zcola

我查了一些文档，早期的Golang版本是感知不到Cgroup的内存限制的。你是下载的我编译好的二进制？还是自己编译的？你的Golang 版本是多少？

自己编译的（go 1.15.7），你这么说有点说的通，go 没有还内存给linux 。即使用不上了

May 23 '22 06:05 zcola

同一个消费组，用了5g 内存的

 go tool pprof 127.0.0.1:8899/debug/pprof/heap
Fetching profile over HTTP from http://127.0.0.1:8899/debug/pprof/heap
Saved profile in /root/pprof/pprof.gohangout.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz
File: gohangout
Type: inuse_space
Time: May 20, 2022 at 9:37am (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 2.10GB, 96.34% of 2.18GB total
Dropped 21 nodes (cum <= 0.01GB)
Showing top 10 nodes out of 52
      flat  flat%   sum%        cum   cum%
    0.65GB 29.89% 29.89%     0.65GB 29.89%  reflect.mapassign
    0.54GB 25.02% 54.91%     0.54GB 25.02%  github.com/childe/gohangout/output.newElasticsearchOutput.func2
    0.32GB 14.73% 69.64%     0.38GB 17.49%  github.com/json-iterator/go.(*Iterator).ReadString
    0.18GB  8.18% 77.82%     0.26GB 12.04%  github.com/json-iterator/go.(*Iterator).Read.func2
    0.17GB  7.60% 85.42%     0.65GB 29.67%  github.com/json-iterator/go.(*Iterator).Read
    0.08GB  3.50% 88.92%     0.67GB 30.79%  github.com/childe/gohangout/output.(*ElasticsearchOutput).Emit
    0.06GB  2.77% 91.68%     0.06GB  2.77%  github.com/json-iterator/go.(*Iterator).readStringSlowPath
    0.06GB  2.62% 94.31%     0.06GB  2.62%  bytes.makeSlice
    0.03GB  1.43% 95.74%     1.44GB 66.02%  github.com/childe/gohangout/codec.(*JsonDecoder).Decode
    0.01GB   0.6% 96.34%     0.01GB   0.6%  github.com/childe/healer.(*Broker).requestStreamingly

看起来重启都没啥用，立刻就5g 了

不过感觉这样分成了两个问题

高压力期间，消耗内存是比较多，如果通过cgroup 限制内存，gohangout 进程会一直都被oom 掉
低压力期间，内存没有回收，影响宿主资源分配调度

也有可能解决了问题二就把问题一解决了，logstash 没有这个问题也在于堆内存是固定的，超了就oom ，实践起来没出现过因为ES 压力过大导致 oom，只有input 上游传了一些特别长的日志，filter 效率低下导致oom

May 23 '22 06:05 zcola

你是什么版本的Golang？换比较新的版本再编译一下看？

我印象里面，我们这边从来没有遇到OOM的问题。

我自己现在不搞这块了，我去测一下 1.17 Golang 编译出来的版本，给一个Cgroup 限制会是啥现象。

May 26 '22 09:05 childe

看了一下我们线上的实例，也有使用很多内存的，没有开 pprof，不确定是用在哪里了，和你那边的是不是一样。

看了一下我们这边没有给Limit。

May 26 '22 09:05 childe

Golang 可以交还内存给操作系统。https://github.com/golang/go/issues/42354 一起学习下~

May 30 '22 06:05 childe

Golang 可以交还内存给操作系统。golang/go#42354 一起学习下~

就是1.17 肯定没有这个问题呗，如果设置了cgroup memory

May 31 '22 02:05 zcola

Golang 可以交还内存给操作系统。golang/go#42354 一起学习下~

就是1.17 肯定没有这个问题呗，如果设置了cgroup memory

1.12 就可以了，但使用的 MADV_FREE 。这个会导致 rss 看起来占用多。你可以使用 GODEBUG=madvdontneed=1 ，这样golang 会使用 MADV_DONTNEED 归还内存，这个会马上释放掉 rss 。

但什么时候会释放，还不知道~

To revert to the Go 1.11 behavior (MADV_DONTNEED), set the environment variable GODEBUG=madvdontneed=1.

May 31 '22 02:05 childe

但是，你的Golang现在heapuse哪怕只有1G，但RSS占用了5G内存，说明至少曾经一定要使用这么多，如果给 4G limit 的话，还是会OOM吧。

Jun 02 '22 02:06 childe

但是，你的Golang现在heapuse哪怕只有1G，但RSS占用了5G内存，说明至少曾经一定要使用这么多，如果给 4G limit 的话，还是会OOM吧。

就是我之前说的两个问题。推理下可能还是无法解决，类推到java ，java 会gc ，内存不够就gc 频繁些，运行效率变低，也有可能来不及导致oom，只能测试下了

高压力期间，消耗内存是比较多，如果通过cgroup 限制内存，gohangout 进程会一直都被oom 掉
低压力期间，内存没有回收，影响宿主资源分配调度

Jun 06 '22 07:06 zcola

但是，你的Golang现在heapuse哪怕只有1G，但RSS占用了5G内存，说明至少曾经一定要使用这么多，如果给 4G limit 的话，还是会OOM吧。

就是我之前说的两个问题。推理下可能还是无法解决，类推到java ，java 会gc ，内存不够就gc 频繁些，运行效率变低，也有可能来不及导致oom，只能测试下了

高压力期间，消耗内存是比较多，如果通过cgroup 限制内存，gohangout 进程会一直都被oom 掉

低压力期间，内存没有回收，影响宿主资源分配调度

第二个问题，会回收。Golang Runtime 会使用 MADV_FREE 将内存还给操作系统。

第一个问题，如果是不能被回收的内存使用较多，的确会OOM。不过看上面 pprof 的结果，可以大概推算出使用的内存量是多少。 worker * (1+output_concurrt) * bulk_size * 4

你可以再乘一个 1.5 给个冗余量。

Jun 06 '22 07:06 childe

另外就是，json-iterator 这个库使用了大量的内存，不知道最新版本是不是有优化，可以升级一下看。如果还是不行，可以看一下原生的Json库（速度会慢一些）

Jun 06 '22 07:06 childe

第二个问题，会回收。Golang Runtime 会使用 MADV_FREE 将内存还给操作系统。实测会，如果设置的太低的确会oom。比如docker stats 输出 1.7g 你限制到 2g 就oom了，实际上我们可以给多一点内存

你可以再乘一个 1.5 给个冗余量。

那就是 3600m，可以接受，我想表达的是如果是es 有瓶颈了，能反压不从kakfa 取新数据不？不知道现在有没有这个机制，反正到最后闲下来内存都是可以回收的，通过docker stats 看。

Jun 06 '22 08:06 zcola

第二个问题，会回收。Golang Runtime 会使用 MADV_FREE 将内存还给操作系统。实测会，如果设置的太低的确会oom。比如docker stats 输出 1.7g 你限制到 2g 就oom了，实际上我们可以给多一点内存

你可以再乘一个 1.5 给个冗余量。

那就是 3600m，可以接受，我想表达的是如果是es 有瓶颈了，能反压不从kakfa 取新数据不？不知道现在有没有这个机制，反正到最后闲下来内存都是可以回收的，通过docker stats 看。

如果ES卡住了，是不会继续消费 Kafka 的。这中间有一个队列长度，就是 bulk_actions * worker * concurrent . 这里的 bulk_actions 也可能是 bulk_size，看谁先到达。

Jun 06 '22 12:06 childe

比如docker stats 输出 1.7g 你限制到 2g 就oom了

关于这个问题，可能是 docker stats 和 cgroup memory controller 的统计口径不一样？比如Page Cache， Share Memory 是不是统计在内，等。

Jun 06 '22 12:06 childe

和我遇到的情况类似，我做了以下几个操作

降低work数，目前是1
降低bulk_size 我从15M->5M
调整es参数 "index.merge.policy.max_merge_at_once": "5", "index.merge.policy.segments_per_tier": "50"
如果使用的client节点，看看是否client节点性能不够了

Jun 07 '22 02:06 MrLiukang

比如docker stats 输出 1.7g 你限制到 2g 就oom了

关于这个问题，可能是 docker stats 和 cgroup memory controller 的统计口径不一样？比如Page Cache， Share Memory 是不是统计在内，等。

这个已知， containers/81c8f70aebad/stats 通过这个api 获得是 7个g 和宿主一致

Jun 08 '22 09:06 zcola

gohangout gohangout copied to clipboard

gohangout 内存占用异常

gohangout
gohangout copied to clipboard