node-red-contrib-loxone icon indicating copy to clipboard operation
node-red-contrib-loxone copied to clipboard

Cause 100% CPU consumption when mini server is at Config 15.3 Beta 3

Open hoschult opened this issue 1 year ago • 23 comments

module seems to be not working when running current beta (Config 15.3 Beta 3). The problem looks like: (1) node red process consumes 100% CPU even if there are no flows activated (2) structure file is recieved, however selector in nodes does not respect the correlation (a switch in room xyz) To isolate the issue I gave the mini server config node an ip address which is not reachable and redeployed the flow - CPU utilization of 100% was gone away; Switching back to the working ip address, high cpu load comes back. How to reproduce the problem: a) take a brand new node red instance b) install node-red-contrib-loxone c) connect to a mini server which is running at Config 15.3 Beta 3 d) observe looping node-red process with top etc... on os level (consumes about 100% cpu)

Loxone Config 15.3 Beta 2 was ok --> I'm not sure if issue is on Loxone side. However (and from my point of view), node-red-contrib-loxone should be robust enough to not go in nested loops (causing 100% cpu utilization) when receiving suspicious Loxone responses.

hoschult avatar Dec 01 '24 11:12 hoschult

Hey Holger,

you should be right. Could you get me the error message from node-red-log?

Cheers, Patrik

codmpm avatar Dec 02 '24 12:12 codmpm

Hi Patrik, there is no error message in node-red-log; In the cause of symptoms (of consuming 100% cpu load), I receive mini server connection timeouts resulting in new connection requests. As a user, I observe that the actions when I turn on a light (switch on Loxone app triggers a flow which creates a API requests towards shelly i.e.) is delayed by 5-10 seconds --> is logical because node-red process is totally loaded. Is there a trick to let the node-red-contrib-loxone do some debug so that I can find out where the nested loops are being done? Already tried to use some profiles.... strace is saying the following:

root@raspimain:~# strace -p 26573 -c
strace: Process 26573 attached
^Cstrace: Process 26573 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 66.36    0.057999          11      5078           clock_gettime64
 10.41    0.009095          57       159           writev
  7.69    0.006719          39       170         2 futex
  4.41    0.003857          43        88           sendmsg
  2.73    0.002390          88        27         3 epoll_ctl
  2.48    0.002171          22        98           write
  1.76    0.001534          23        66           read
  1.61    0.001409         128        11           epoll_wait
  1.00    0.000874         124         7         7 connect
  0.64    0.000559          69         8           close
  0.23    0.000201           6        31           setsockopt
  0.18    0.000156           4        36           getpid
  0.18    0.000155          51         3           shutdown
  0.14    0.000126          18         7           socket
  0.14    0.000120          30         4         2 accept4
  0.04    0.000031           4         7           getsockopt
  0.00    0.000000           0         8           mprotect
  0.00    0.000000           0         1           getpeername
  0.00    0.000000           0         3           cacheflush
------ ----------- ----------- --------- --------- ----------------
100.00    0.087396          15      5812        14 total

hoschult avatar Dec 02 '24 12:12 hoschult

Hmmm. That's strange. As typically there should be something to point in the direction where the problem occurs.

Will have to install beta 3 to check for myself.

codmpm avatar Dec 02 '24 12:12 codmpm

yes, same view. I have opened up a Loxone ticket; They told me that it's hard to do a downgrade of the firmware (just for your attention). Furthermore, there will be a Beta 4 arriving today or tomorrow; Will keep you updated.

hoschult avatar Dec 02 '24 13:12 hoschult

Oh, good to know. Could you test with beta 4?

If the problem persists I should change node-red-contrib-loxone. But of course this makes no sense for betas ;-)

codmpm avatar Dec 02 '24 13:12 codmpm

sure, will do. Searched around what clock_gettime64 from strace means (because this is the top scored call). Use case is a timing on, i.e.: Commonly used for high-resolution timing and performance measurements. Do you have any timing relevant code snippets in the module code?

hoschult avatar Dec 02 '24 13:12 hoschult

Do you have any timing relevant code snippets in the module code?

Not that I'm aware of.

codmpm avatar Dec 02 '24 14:12 codmpm

beta 4 was released and do have unfortunately the same load issue (100% CPU allocation on node-red process). Did some investigation with node.js profiler. Would like to share the following top top ticks which were accounted:

image

Steps to do the profiling: (1) start node red in profiling mode: NODE_ENV=production node --prof /usr/lib/node_modules/node-red/red.js (2) analyze the trace being created: node --prof-process isolate-0x5a1f408-28832-v8.log > processed.txt

hoschult avatar Dec 02 '24 19:12 hoschult

had a look at my LoxAPP3.json which comes from the miniserver; There are a lot of uuid resources defined; Maybe above shared profiles ticks comes from nested loops within node-lox-ws-api? The relevant code chapters are describing a couple of foreach like this:

this.connection.on('message_event_table_values', function(messages) {
   messages.forEach(function(evt){
       that.emit('update_event_value', evt.uuid.string, evt.value);
       that.emit('update_event_value_' + evt.uuid.string, evt.value);
   });
   that.emit('message_event_table_values', messages);

});

Only an assumption....

hoschult avatar Dec 02 '24 19:12 hoschult

I have to look into this but my time is currently limited. Feel free to open a pull request. As we already forked node-lox-ws-api there should be no problem to incorporate a fix there.

codmpm avatar Dec 02 '24 21:12 codmpm

some updates..... node.js inspection (via chrome dev tools profiler visualizer) says that process is busy all the time with running "processImmediate" (violett blocks in the screenshot of the flow bars):

image

Drilling down into the processing tree of re-occuring tasks (_updateEvent) points to the above mentioned code chapters:

image

which ends up at:

image

in loxone.js

this calls the expensive function in API.js:

image

Feel free to open a pull request.

try to support as best as possible, however I have doubts if my coder skills are strong enough :-(

hoschult avatar Dec 03 '24 08:12 hoschult

As 15.3 should be releases today, could you check it again with the final release? Of course I have to fix the bug then...

codmpm avatar Dec 03 '24 09:12 codmpm

now running on 15.3 (no beta any more), however same behavior :-( Some thoughts: Issue came to my attention between beta2 and beta3. In this step, they introduced MQTT client; Maybe they have changed the structure or how they operate on events/messages?

My IDE advised me to use the following code instead (at API.js beginning in line 127) to reduce emit calls and to stop foreach traveling when matching pair was found:

this.connection.on('message_event_table_values', function (messages) {
    const updateEvents = [];
    const specificUpdates = {};

    // Efficiently collect events
    for (const evt of messages) {
        updateEvents.push({ uuid: evt.uuid.string, value: evt.value });
        specificUpdates[`update_event_value_${evt.uuid.string}`] = evt.value;
    }

    // Batch emit operation
    that.emit('update_event_values', updateEvents);

    // Emit specific UUID updates
    for (const [eventKey, value] of Object.entries(specificUpdates)) {
        that.emit(eventKey, value);
    }

    // Emit the entire message
    that.emit('message_event_table_values', messages);
});

hoschult avatar Dec 03 '24 10:12 hoschult

see this pull request. CPU load has gone (good :-) ).... However delay in events stays; 5-10 seconds after loxone switch was triggered, message appeared in node red. Not acceptable user experience :-(

hoschult avatar Dec 03 '24 15:12 hoschult

while testing, seeing some events which are not being catched anymore: i.e. RGB light circuit in light module (lichtbaustein)

hoschult avatar Dec 03 '24 16:12 hoschult

Thank you so much. Will have to look into this....

codmpm avatar Dec 03 '24 17:12 codmpm

thx for the prework - so upgrading to 15.3 public isn't recommended at the moment, if you rely on nodered working with Loxone

bausi2k avatar Dec 04 '24 11:12 bausi2k

yes, at least in my humble opinion it's better to stay a moment on the older branch (especially because downgrade of miniserver is not that simple according to Loxone) .... Let's wait until Patrick can take a look into it.

hoschult avatar Dec 04 '24 12:12 hoschult

is it save to update? any news?

bausi2k avatar Dec 13 '24 11:12 bausi2k

Did not have time to look into it, yet. Sorry.

codmpm avatar Dec 13 '24 11:12 codmpm

just asking - thx for your time spent for the project.

bausi2k avatar Dec 13 '24 12:12 bausi2k

I'm on 15.3.12.13 with NR 4.0.7. No problem till now. So, the problem comes with an even newer version of Loxone config?

thmeger avatar Dec 19 '24 16:12 thmeger

Also no problems after updating also on my side.

bausi2k avatar Dec 30 '24 09:12 bausi2k