steve Duplicate of 812. Memory increase after upgrade to 3.4.9

I have the same issue. 3.4.9 memory consumption grows to about 12GB over 6-7 days and then crashes.

It is on my server: Linux 5.13.0-1025-azure on x86_64 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz, 4 cores Virtualmin version 7.1-1 Real Memory 15GB

java -version openjdk version "11.0.15" 2022-04-19 OpenJDK Runtime Environment (build 11.0.15+10-Ubuntu-0ubuntu0.20.04.1) OpenJDK 64-Bit Server VM (build 11.0.15+10-Ubuntu-0ubuntu0.20.04.1, mixed mode, sharing)

I have about 20 charge points running, mostly Alfen with one Phihong

After downgrading to an earlier version of SteVe (3.4.6) it is now working properly (average memory 650MB)

Jun 22 '22 11:06 rdc-Green

#812

Jun 25 '22 08:06 goekay

HI @goekay i appreciate this is a duplicate, but i didn't think #812 had been resolved. It is closed so i assumed my comment i added wouldn't be responded to, so I started a new ticket.

Jun 27 '22 10:06 rdc-Green

it's all good. i just referenced it for the sake of completeness.

in terms of the problem at hand, i don't have anything to add. i cannot reproduce it (as i illustrated in my load tests). unfortunately, i don't know how to help.

Jun 27 '22 10:06 goekay

ok, thanks for clarifying.

I'll see if we can find anything our side and report back.

Jun 27 '22 10:06 rdc-Green

@rdc-Green I was wondering if you had happened to have made any changes to your database instance just prior to seeing the behaviour reported? Particularly with session timeouts?

Jul 20 '22 03:07 shadsnz

Hi @shadsnz we haven't made any changes to the database instance.

Jul 20 '22 11:07 rdc-Green

Hi, I think I have found the issue.

I started to see this issue after the update to 3.5.0 from the 3.4.3. It was crashing under the same situation and within a few minutes of starting up. It's a AWS 0.5vCPU and 1Gb.

Initially I thought it would be because of the database, the MeterValues tables has around 78Gb, it wasn't.

Then I started to look at the number of messages per second. The setup that I have, the MeterValues are configured to be sent every second. And when there were around 20 transaction at the same time, the instance would crash.

Also the zig-zag patters on the heap happen much more frequent on the new versions than on 3.4.3. Even with only one message per second. The heap would grow 50Mb in a few seconds, and then the GC would release it. On 3.4.3 the same pattern happens every 5 minutes or so.

With this, I started to bisect the git history to find where it the new behaviour was introduced. Which is this one https://github.com/steve-community/steve/commit/7a41a3f6b2f270ec3ad58d908e11b053b2564e12 Essentially, the migration to Jetty 10 was the cause.

With a few more digging, I noticed that the Buffer Pools on the JVM grew 8Mb every time a message is received. So, it seems that the new Jetty version allocates 8Mb of memory when a message is received. This memory is then released a few seconds later when the GC is triggered.

These two pictures show this behaviour in action. Heap CPU with max size configured Direct Buffers with max size configured

After some tries, I applied this patch on top of 3.5.0. As MAX_MSG_SIZE is 8Mb.

diff --git a/src/main/java/de/rwth/idsg/steve/SteveAppContext.java b/src/main/java/de/rwth/idsg/steve/SteveAppContext.java
index 6c86c215..bedb9cf4 100644
--- a/src/main/java/de/rwth/idsg/steve/SteveAppContext.java
+++ b/src/main/java/de/rwth/idsg/steve/SteveAppContext.java
@@ -80,9 +80,9 @@ public class SteveAppContext {
      */
     public void configureWebSocket() {
         JettyWebSocketServerContainer container = JettyWebSocketServerContainer.getContainer(webAppContext.getServletContext());
-        container.setInputBufferSize(MAX_MSG_SIZE);
-        container.setOutputBufferSize(MAX_MSG_SIZE);
-        container.setMaxTextMessageSize(MAX_MSG_SIZE);
+        //container.setInputBufferSize(MAX_MSG_SIZE);
+        //container.setOutputBufferSize(MAX_MSG_SIZE);
+        //container.setMaxTextMessageSize(MAX_MSG_SIZE);
         container.setIdleTimeout(IDLE_TIMEOUT);
     }

This seems to bring back the old behaviour. But I am not familiar with Jetty, and from the little reading that I did today, I don't fully understand those settings and why it's not reusing the allocated memory. I will do a bit more reading on Monday, but if you have any ideas on this, let me know.

P.S. I am currently run a test with 120 simulated EVSE, reporting MeterValues every second. This will be running the whole weekend to see how it goes. But so far, it seems good. The same two graph as before for the last 15 minutes.

Feb 03 '23 19:02 jpires

The weekend test went well, and the memory was stable for the whole weekend. It seems to me that it's behaving in the same way as before.

After reading a bit more about this, the default value for the MaxTextMessageSize is 65k, and if a message is bigger than this value, it will be rejected. From the top of head, this doesn't seem to be a big issue, as the OCPP messages tend to be quite small. I think only the MeterValues.req and the DataTransfer are the only ones that have arbitrary sizes.

Does anyone know of a situation were this 65k would be a limitation?

P.S. Keeping the current MaxTextMessageSize, doesn't seem affect the allocation, and the memory is still stable.

Feb 06 '23 18:02 jpires

hey @jpires this is amazing detective work! thanks for your diagnosis!

Does anyone know of a situation were this 65k would be a limitation?

i remember having issues with one of the messages, which you pointed out, with a station because of which we increased the default.

--

it is weird that we had this config for a long time never causing issues... only to be a problem after jetty 10 migration. my changes in https://github.com/steve-community/steve/commit/7a41a3f6b2f270ec3ad58d908e11b053b2564e12 just use the new api. therefore, they are only from syntactical nature. the defaults from back then were 65K were as well. this alone should not have changed the behaviour.

i assume during the refactoring of websocket (this is what happened with jetty 10) some behaviour changed or regression was introduced. the regression must be flying under the radar with small size (65K), while having a greater value (8Mb) accentuates the regression. there are a couple of jetty issues that hint at similar memory problems. for example: https://github.com/eclipse/jetty.project/issues/6696 and https://github.com/eclipse/jetty.project/issues/6328

i get a "rabbit hole" vibe from these jetty discussions and therefore would want to be pragmatic: remove our special settings and fallback to jetty defaults (as is the case in your git diff). i would like your valuable work to be part of this project as an official contribution under your name. can you pls make a PR? i will merge it ASAP.

Feb 11 '23 03:02 goekay

I have created a pull request with the changes. I have only removed the calls to setInputBufferSize and setOutputBufferSize. This seems to solve the issues, but we are still able to receive messages bigger than 64k.

it is weird that we had this config for a long time never causing issues... only to be a problem after jetty 10 migration. my changes in https://github.com/steve-community/steve/commit/7a41a3f6b2f270ec3ad58d908e11b053b2564e12 just use the new api. therefore, they are only from syntactical nature. the defaults from back then were 65K were as well. this alone should not have changed the behaviour.

It seems to me that in that before that commit, the setInputBufferSize and setOutputBufferSize are never called, and these two buffer are still the 4K. And looking at the changes on the commit, they don't show up on the deleted changes.

I did some digging(very brief) on Jetty code, and those two settings control the size of the buffer used to perform the operation on the "socket". So, it allocates the buffer, read from the socket, copies the read data to a different buffer and deallocated this buffer. Anyway, the size of this buffer doesn't really matter, the read operation will continue until there are no more data on the socket. Meaning that if a message is bigger than 4K, the read operation will be repeated. I guess that increasing the size of these buffer would help if there are a lot of message bigger than 4K, but I don't think that is the case with OCPP.

I would say that we can probably close this ticket after the pull request is merged. But it would be great to hear from the other people who experienced the this issue.

Feb 15 '23 19:02 jpires

fixed by #1058

Feb 16 '23 07:02 goekay

steve steve copied to clipboard

Duplicate of 812. Memory increase after upgrade to 3.4.9

steve
steve copied to clipboard