Xray-core Memory leak

There is / has been a memory leak issue since at least version 1.8.9. I have tested the latest release, and it is leaking at a rate of about 1.7 GB for every 100 GB of traffic proxied, during a time frame of 24 hours and with 50 active users. I have not done experiments to correlate these numbers with each other, but it seems like the amount of leaked memory is proportional to the served traffic. There is also no ceiling to this, it just grows until systemd-oomd kills it. I have tried using vless only, vless+vmess and vless+trojan, and avoided the newly added transports but to no avail. I would be more than happy to experiment more and provide the team with more information. I am running xray behind haproxy and that behind cloudflare CDN.

OS: Ubuntu 22.04, all packages up-to-date. Architecture: Armv8, Neoverse-N1 at Hetzner. Also generic x86 platforms. Memory: 4GB Kernel: Generic Ubuntu 5.15.0-xxx Logging: Disabled

Apr 02 '24 15:04 erfan-khadem

heap.zip Here is the result of heap profiling using pprof. memory usage was about 800MB when I took this profile.

Apr 02 '24 21:04 erfan-khadem

@er888kh what kind of transmission do you use? i know that grpc has memory leak in 1.8.0+ above but ws is rather fine in my experience

Apr 08 '24 05:04 GFW-knocker

same problem , in my experience , version 1.8.4 is fine , and after that version , xray will leak

Apr 12 '24 00:04 50infivedays

same problem , in my experience , version 1.8.4 is fine , and after that version , xray will leak

麻烦逐个测试 v1.8.4 至 v1.8.6 之间的 commits，看一下哪个 commit 出现该问题

@er888kh 你那边也是这样吗？

Apr 21 '24 20:04 RPRX

heap.zip Here is the result of heap profiling using pprof. memory usage was about 800MB when I took this profile.

From this pprof, the biggest is at readv reader with 220MB. Although I'm not sure if this is indeed a leak. Can you profile a bigger usage? (Or reduce the buffer size to isolate the issue)

Apr 29 '24 15:04 yuhan6665

Same issue on MT7981A (ARM64, cortex A53) with 512MB RAM, have try 1.8.4 and 1.8.10, both consume all RAM on device after first speedtest running and hangs, switch to sing-box 1.8.10 without change anything on server side (xray-core 1.8.10) and have no issues with memory consumption (~180 MB RAM after 10 and more tests).

P.S. buffer-size is set to 4

Apr 30 '24 10:04 GektorUA

Same issue on MT7981A (ARM64, cortex A53) with 512MB RAM, have try 1.8.4 and 1.8.10, both consume all RAM on device after first speedtest running and hangs

Same problem when running speedtest with chain-proxy config and using Xray-Core

Apr 30 '24 20:04 amirhosss

Same problem

May 09 '24 16:05 taoabc

my VM faced an OS crash yesterday because of OOM.

here is the screenshot of the console after the crash xray-memory-leak

never did profiling for xray , but could it be for memory leaks ? protocol: vless-grpc number of users: around 30/40 number of concurrent users : 10/15 OS : Debian 11 x64 physical memory: 1G swap : off xray : Xray 1.8.7 (Xray, Penetrates Everything.) 3f0bc13 (go1.21.5 linux/amd64)

total-vm: 2558796KB anon-rss: 561376KB file-rss: 0KB

which protocol has the least memory leak ?

did not have full console access and so was not able to recover the OS; I had to rebuild it and no further logs

May 16 '24 08:05 shakibamoshiri

my VM faced an OS crash yesterday because of OOM.

here is the screenshot of the console after the crash

never did profiling for xray , but could it be for memory leaks ? protocol: vless-grpc number of users: around 30/40 number of concurrent users : 10/15 OS : Debian 11 x64 physical memory: 1G swap : off xray : Xray 1.8.7 (Xray, Penetrates Everything.) 3f0bc13 (go1.21.5 linux/amd64)

total-vm: 2558796KB anon-rss: 561376KB file-rss: 0KB

which protocol has the least memory leak ?

did not have full console access and so was not able to recover the OS; I had to rebuild it and no further logs

I updated to 1.8.11, seems it's been mitigated. While it's easy to repro when I use 1.8.10

May 17 '24 03:05 taoabc

same problem on the latest version 1.18.13.

Jun 15 '24 13:06 EldestBard

I'm using version 1.8.16 and problem solved. except shadowsocks2022, still leak

Jun 25 '24 19:06 masbur

there is similar report's in marzban https://github.com/Gozargah/Marzban/issues/1062 https://github.com/Gozargah/Marzban/issues/992 https://github.com/Gozargah/Marzban/issues/814

Aug 13 '24 21:08 M03ED

@M03ED just in case what I was facing was not OOM issue. I couldn't find any reason for the issue. but with another configuration (WS+VMESS) I have about 5k sockstat TCP metrics and there are no issues with connections.

Aug 15 '24 19:08 majidsadr

I think this issue has become about too many things at once (socket/filedescriptor leak versus memory leak) and each of the reports is not very specific. I suggest to attempt these things:

reduce the amount of inbounds/outbounds per xray process to pin it down to a specific transport/protocol. For example, if you have a node with 6 inbounds, try splitting up the node into two with 3 each, then whichever produces a memory leak, split it up again. try to arrive at a server json config that can reproduce the issue reliably and is free of unnecessary things.
in case of OOM, once you have a node that has a minimal configuration, use pprof (like the OP) to produce a heap profile. you can follow https://xtls.github.io/en/config/metrics.html#pprof to configure it, although I don't know if it can be done in panels easily.
rprx and yuhan gave specific questions already (please test all commits in the range, please profile a bigger usage), but i don't see a followup to them.
if anything about the core developer's response is unclear, please ask, don't just ignore it and post "same here"

Aug 15 '24 19:08 mmmray

i checked original v2ray repository and found same bug https://github.com/v2fly/v2ray-core/issues/3086

I used the patch from the bug and it seems to have helped with the big leak, at least the process doesn't eat memory as aggressively. tested on 1.8.23 release code

place to patch https://github.com/XTLS/Xray-core/blob/41d03d1856a5b16521610792601a20fb1195418e/transport/pipe/impl.go#L92

applied patch

func (p *pipe) ReadMultiBuffer() (buf.MultiBuffer, error) {
	for {
		data, err := p.readMultiBufferInternal()
		if data != nil || err != nil {
			p.writeSignal.Signal()
			return data, err
		}

		timer := time.NewTimer(15 * time.Minute)
		select {
		case <-p.readSignal.Wait():
		case <-p.done.Wait():
		case <-timer.C: // new add
			return nil, buf.ErrReadTimeout // new add
		case err = <-p.errChan:
			return nil, err
		}
		timer.Stop() // new add
	}
}

Aug 20 '24 17:08 boris768

@boris768 Then it seems the real issue might be that some pipe is not correctly closed somewhere, or generally resources are not cleaned up in some inbound/outbound or transport. Unfortunately this patch has the ability to work around a variety of different issues, it doesn't really explain what's going on IMO

Aug 25 '24 19:08 mmmray

@mmmray, i testing this patch, that can be used as hack way to fix memory leak. I using simple XTLS-Reality setup and without fix 8-10 clients gives xray memory usage up to 700-800 mb in one day (OOM kills service). With fix, memory peak usage reaches 150mb. At least, it gives info about mem leak place, i hope, it will help to fix.

Aug 26 '24 14:08 boris768

Is the problem solved ?

Sep 09 '24 19:09 M03ED

Not completed, but it's missing too much information (for example, what version was it introduced?), nobody has investigated it much, and it's not clear if this is 1 issue or N issues. As a developer I also wouldn't know what to do with it right now. I guess, let's reopen when somebody has managed to make another heap profile or some other discovery. I think the patch cannot help like this, to be honest.

Sep 09 '24 20:09 mmmray

Xray-core Xray-core copied to clipboard

Memory leak

Xray-core
Xray-core copied to clipboard