Xray-core
Xray-core copied to clipboard
Memory leak
There is / has been a memory leak issue since at least version 1.8.9. I have tested the latest release, and it is leaking at a rate of about 1.7 GB for every 100 GB of traffic proxied, during a time frame of 24 hours and with 50 active users. I have not done experiments to correlate these numbers with each other, but it seems like the amount of leaked memory is proportional to the served traffic. There is also no ceiling to this, it just grows until systemd-oomd kills it. I have tried using vless only, vless+vmess and vless+trojan, and avoided the newly added transports but to no avail. I would be more than happy to experiment more and provide the team with more information. I am running xray behind haproxy and that behind cloudflare CDN.
OS: Ubuntu 22.04, all packages up-to-date.
Architecture: Armv8, Neoverse-N1 at Hetzner. Also generic x86 platforms.
Memory: 4GB
Kernel: Generic Ubuntu 5.15.0-xxx
Logging: Disabled
heap.zip
Here is the result of heap profiling using pprof. memory usage was about 800MB when I took this profile.
@er888kh what kind of transmission do you use? i know that grpc has memory leak in 1.8.0+ above but ws is rather fine in my experience
same problem , in my experience , version 1.8.4 is fine , and after that version , xray will leak
same problem , in my experience , version 1.8.4 is fine , and after that version , xray will leak
麻烦逐个测试 v1.8.4 至 v1.8.6 之间的 commits,看一下哪个 commit 出现该问题
@er888kh 你那边也是这样吗?
heap.zip Here is the result of heap profiling using
pprof. memory usage was about 800MB when I took this profile.
From this pprof, the biggest is at readv reader with 220MB. Although I'm not sure if this is indeed a leak. Can you profile a bigger usage? (Or reduce the buffer size to isolate the issue)
Same issue on MT7981A (ARM64, cortex A53) with 512MB RAM, have try 1.8.4 and 1.8.10, both consume all RAM on device after first speedtest running and hangs, switch to sing-box 1.8.10 without change anything on server side (xray-core 1.8.10) and have no issues with memory consumption (~180 MB RAM after 10 and more tests).
P.S. buffer-size is set to 4
Same issue on MT7981A (ARM64, cortex A53) with 512MB RAM, have try 1.8.4 and 1.8.10, both consume all RAM on device after first speedtest running and hangs
Same problem when running speedtest with chain-proxy config and using Xray-Core
Same problem
my VM faced an OS crash yesterday because of OOM.
here is the screenshot of the console after the crash
never did profiling for xray , but could it be for memory leaks ?
protocol: vless-grpc
number of users: around 30/40
number of concurrent users : 10/15
OS : Debian 11 x64
physical memory: 1G
swap : off
xray : Xray 1.8.7 (Xray, Penetrates Everything.) 3f0bc13 (go1.21.5 linux/amd64)
total-vm: 2558796KB
anon-rss: 561376KB
file-rss: 0KB
which protocol has the least memory leak ?
did not have full console access and so was not able to recover the OS; I had to rebuild it and no further logs
my VM faced an OS crash yesterday because of OOM.
here is the screenshot of the console after the crash
never did profiling for xray , but could it be for memory leaks ? protocol:
vless-grpcnumber of users: around30/40number of concurrent users :10/15OS :Debian 11 x64physical memory:1Gswap : off xray :Xray 1.8.7 (Xray, Penetrates Everything.) 3f0bc13 (go1.21.5 linux/amd64)total-vm:
2558796KBanon-rss:561376KBfile-rss:0KBwhich protocol has the least memory leak ?
did not have full console access and so was not able to recover the OS; I had to rebuild it and no further logs
I updated to 1.8.11, seems it's been mitigated. While it's easy to repro when I use 1.8.10
same problem on the latest version 1.18.13.
I'm using version 1.8.16 and problem solved. except shadowsocks2022, still leak
there is similar report's in marzban https://github.com/Gozargah/Marzban/issues/1062 https://github.com/Gozargah/Marzban/issues/992 https://github.com/Gozargah/Marzban/issues/814
@M03ED just in case what I was facing was not OOM issue. I couldn't find any reason for the issue. but with another configuration (WS+VMESS) I have about 5k sockstat TCP metrics and there are no issues with connections.
I think this issue has become about too many things at once (socket/filedescriptor leak versus memory leak) and each of the reports is not very specific. I suggest to attempt these things:
- reduce the amount of inbounds/outbounds per xray process to pin it down to a specific transport/protocol. For example, if you have a node with 6 inbounds, try splitting up the node into two with 3 each, then whichever produces a memory leak, split it up again. try to arrive at a server json config that can reproduce the issue reliably and is free of unnecessary things.
- in case of OOM, once you have a node that has a minimal configuration, use pprof (like the OP) to produce a heap profile. you can follow https://xtls.github.io/en/config/metrics.html#pprof to configure it, although I don't know if it can be done in panels easily.
- rprx and yuhan gave specific questions already (please test all commits in the range, please profile a bigger usage), but i don't see a followup to them.
- if anything about the core developer's response is unclear, please ask, don't just ignore it and post "same here"
i checked original v2ray repository and found same bug https://github.com/v2fly/v2ray-core/issues/3086
I used the patch from the bug and it seems to have helped with the big leak, at least the process doesn't eat memory as aggressively. tested on 1.8.23 release code
place to patch https://github.com/XTLS/Xray-core/blob/41d03d1856a5b16521610792601a20fb1195418e/transport/pipe/impl.go#L92
applied patch
func (p *pipe) ReadMultiBuffer() (buf.MultiBuffer, error) {
for {
data, err := p.readMultiBufferInternal()
if data != nil || err != nil {
p.writeSignal.Signal()
return data, err
}
timer := time.NewTimer(15 * time.Minute)
select {
case <-p.readSignal.Wait():
case <-p.done.Wait():
case <-timer.C: // new add
return nil, buf.ErrReadTimeout // new add
case err = <-p.errChan:
return nil, err
}
timer.Stop() // new add
}
}
@boris768 Then it seems the real issue might be that some pipe is not correctly closed somewhere, or generally resources are not cleaned up in some inbound/outbound or transport. Unfortunately this patch has the ability to work around a variety of different issues, it doesn't really explain what's going on IMO
@mmmray, i testing this patch, that can be used as hack way to fix memory leak. I using simple XTLS-Reality setup and without fix 8-10 clients gives xray memory usage up to 700-800 mb in one day (OOM kills service). With fix, memory peak usage reaches 150mb. At least, it gives info about mem leak place, i hope, it will help to fix.
Is the problem solved ?
Not completed, but it's missing too much information (for example, what version was it introduced?), nobody has investigated it much, and it's not clear if this is 1 issue or N issues. As a developer I also wouldn't know what to do with it right now. I guess, let's reopen when somebody has managed to make another heap profile or some other discovery. I think the patch cannot help like this, to be honest.
