n8n icon indicating copy to clipboard operation
n8n copied to clipboard

MQTT trigger node executes more than one time

Open edlundin opened this issue 1 year ago • 51 comments

Hi!

After encountering significant problems with the MQTT trigger node, I decided to write this up, hoping it could help identify the origin of the issue.

Describe the bug The MQTT trigger node behaviour is erratic.

I wrote two workflows, each containing an MQTT trigger node and a function one (it only logs "Done!" in the console). Both of the MQTT nodes are subscribing to the same topic.

It turns out that sometimes my workflow is executed multiple times, even when receiving only one message.

When activating only one of the two flows, receiving a message on the subscribed topic causes a spike in resource consumption. Sometimes it consumes a lot of memory (from ~400MB to ~3.2GB, for one message in the worst case I witnessed), sometimes not so much, depending on how many times the workflow ran.

The real issue occurs when enabling both workflows. Like above, the execution history shows multiple executions for one received message. Besides both of them being enabled (MQTT 1 & 2), only one is executed (MQTT 2).

execution_history

The memory consumption explodes, as well as the system charge. Often, it will fill the RAM and swap until the whole machine crashes. Even if you somehow survive this, the following message often triggers the same bug, even if one of the workflows has been disabled.

A screenshot of HTOP a bit before the VM dies:

Screenshot 2022-08-03 at 15 45 10

To Reproduce Steps to reproduce the behaviour:

  1. Create a workflow with an MQTT trigger node and a function node
  2. Configure the MQTT trigger node
  3. Duplicate the workflow
  4. Enable both of the workflows
  5. Publish messages on the subscribed topic

Expected behaviour Only one execution per workflow per MQTT message. Memory consumption consistency for this particular node.

Environment:

  • OS: Debian 11.4 / n8n docker image
  • n8n Version: 0.188.0
  • Node.js Version: node v18.7.0 / node v16.16
  • Database system: SQLite
  • Operation mode: Trigger
  • VM configuration: 4 cores - 8GB of RAM

Additional context Sent MQTT message, always in QOS 0:

{
  "msg": "hello2"
}

edlundin avatar Aug 03 '22 14:08 edlundin

I am seeing this problem too... multiple executions for the same message. I have verified that it is actually the same payload triggering multiple times the same workflow.

I connect other software, such as the MQTT explorer, and they see the message only once, yet in n8n, multiple executions are triggered.

rvalle avatar Aug 25 '22 13:08 rvalle

Actually this is my second MQTT workflow, but I have looked but at executions of the previous one, and I can also see 10s of executions of the workflow within one second.

This may be a regression, because I have had my other workflow running for about a year, and back then, when I set it up, I did not notice any issue like this. I would have noticed this.

rvalle avatar Aug 25 '22 13:08 rvalle

cc @Joffcom MQTT seems to be complete broken to me, and affects the stability of n8n. Once incoming message triggers 10s of executions.

I am reviewing the implementation of this trigger and looking into MQTT.js for clues of what could be going wrong.

Any idea of what could be happening?

I will check if I can reproduce this behavior from a vanilla node 16 test.

rvalle avatar Aug 25 '22 15:08 rvalle

Here is how dramatic the resource leakage by MQTT trigger is:

n8n MQTT - Grafana

you can guess the blue n8n container jumped to 5GB almost and crashed its VM.

rvalle avatar Aug 25 '22 18:08 rvalle

I did some additional testing with a brand new container, and I got 2 listeners to work normally, and adding a third reproduced the problem. So, @edlundin seems right, the cardinality of listeners seems to play a important role.

rvalle avatar Aug 26 '22 07:08 rvalle

I think I found the problem.

@edlundin did you fill the client id in your MQTT listener settings?

I did. And that causes the trouble. The issue is that according to the MQTT specifications the broker should not allow more than one connection with the same client ID (not to confuse with username/password).

If you create several listeners with the same client id it will crash, but if you leave it blank, n8n seems to generate a different client id per listener.

The fact is that client-id should not be a connection setting but actually a listener or publisher option, and the recommendation should be to leave it blank for auto-generation or be warned that must be assured uniqueness (n8n could also enforce uniqueness).

During the testing I also noticed that MQTT listeners whose connections are closed during n8n start will crash the container and put it in an state in which it cannot be fixed. A safe start mode would be nice, maybe it exists, so that you can disable "poisoned" workflows.

So, I have so far setup MQTT listeners, with same credentials and black client id and seems to be working now.

I use mosquito-mqtt as broker.

@edlundin could you check if this also fixes your isse? cc @Joffcom

rvalle avatar Aug 26 '22 07:08 rvalle

I checked my client ID, and it seems to be the issue's root since they were the same in both listeners. To avoid any confusion in the future, maybe the clientId field should be outside of MQTT credentials.

I am using EMQX as a broker. I tried using an external MQTT client (MQTTX) with the same client ID as the listener in n8n. It seems that (in the case of EMQX at least), the last client initiating a connection with a conflicting client ID will get to keep it, while the server will disconnect the already connected client.

This client will then issue a reconnect, always rejected by the server. Could it be that n8n as a problem with the state of this last client?

edlundin avatar Aug 26 '22 08:08 edlundin

I think multiple connections with same client-id are basically banned by the protocol. Like 2 computers with the same IP on the same network, anything can go wrong from then on.

I think the bug is related to UX, client-id should not be asked when setting credentials because it creates the error. I did set n8n as client id.

client id should be generated automatically without asking the user, and perhaps be an advanced option of the listener or published in case there is a particular use case that requires specific client-id.

rvalle avatar Aug 26 '22 08:08 rvalle

That is some nice work there @rvalle, I am not sure if we had started to look into this one yet but it sounds like a quick win could be to mark it as advanced with a note about it.

Joffcom avatar Aug 26 '22 08:08 Joffcom

I agree with not asking for the client ID while setting the credentials since multiple listeners can use them, unlike the client ID. On the other hand, I would not suggest generating the client ID without the user's say. You might want to pre-fill the client ID field with something (e.g. "n8n_$TIMESTAMP_MS"), but ultimately the user has to be able to change it with ease (so without hiding it in an "advanced panel" or equivalent).

Finally, while the protocol prohibits multiple connections with the same client ID, I still think there is a deeper issue in how the MQTT node (and possibly n8n) deals with that error internally. Even if pre-filling the client ID will mitigate the problem, n8n should not crash an entire system because a user forgot or is unaware of the MQTT specification.

edlundin avatar Aug 26 '22 08:08 edlundin

@edlundin I mostly agree. The client ID seems of little value, other than debugging. I have checked other software and all seem to do the same, ask for it and if not autogenerate. n8n is generating mqttjs-ramdomid, so I guess it is the underlying library.

Your proposal of generating for the user to see: n8n-ramdomid is better, because it aids debugging in case you want to actually know which listener is which connection. However that would require first move the client-id as a listener/publisher option.

I agree with @Joffcom that the ultra-quick fix is just to warn users not to set it unless they know the implications (advanced). Currently the implication is that one set of credentials + client id needs to be create per MQTT node.

In a second step I would move it out of credentials into the nodes with your suggestion of pre-generating it and a warning that replacing it is advanced.

it would be amazing if n8n enforces uniqueness of client-ids, and fails at UI to setup a repeated client-id , but hat is a stretch goal!

rvalle avatar Aug 26 '22 08:08 rvalle

@Joffcom I have now been following this for some weeks.

Even thou stability of MQTT listeners improves massively after workarounding the ID issue, there remain problems.

In my case listeners will start delivering one extra copy of the message per execution week. i.e. After 2 weeks up and running messages are coming x3 times. Otherwise the system is stable, no massive resource leak, etc. A restart of n8n container will fix the issue.

@edlundin are you seeing this behavior too?

rvalle avatar Sep 12 '22 09:09 rvalle

Hey @rvalle,

When you say the listeners are delivering an extra message is that the trigger in n8n receiving an extra message or something else sending a message?

I will have a look later this week to see if I can reproduce the duplicate items being read issue.

Joffcom avatar Sep 12 '22 09:09 Joffcom

@Joffcom I mean that n8n will start duplicating message deliveries.

I connect n8n and mqtt-explorer to mosquito. n8n consistently reports duplicated messages while mqtt explorer reports only 1 copy. Both systems have connected with the same QOS. And as said, restarting n8n will fix the issue.

I believe the extra messages are been generated by the MQTT stack inside n8n. It will be difficult to reproduce, or will require some kind of stress testing because I only see this condition after one week of n8n running.

To give you an idea of traffic/size: I have around 40 devices in my MQTT-IOT network, 3 of then trigger workflows. So, in n8n I have 1 MQTT settings, 3 listeners. My actual workflow will send a message to Matrix IM after checking MQTT payload for some condition.

Other projects using this MQTT stack must have experience similar issues. I guess.

rvalle avatar Sep 13 '22 06:09 rvalle

Hey @rvalle,

I did see one person mention duplicate messages on the mqtt.js issue tracker but that was it. In theory if it happens after a week I should be able to set up an MQTT server and fire some messages in and just watch the results to see what happens.

It looks like mqtt.js has a debug env option so I can see if that works as well, When you are running is n8n or mqtt ever restarted?

Joffcom avatar Sep 13 '22 08:09 Joffcom

Hey @rvalle,

What do you have the QOS option set to? When you say you have 3 listeners is that 3 trigger nodes looking at different topics or the same topic?

Joffcom avatar Sep 15 '22 08:09 Joffcom

@Joffcom they look at different topics. No special options setup (JSON Parsing only), no QOS setup (defaults to 0).

rvalle avatar Sep 15 '22 08:09 rvalle

@rvalle perfect and how many messages is it roughly a day?

Joffcom avatar Sep 15 '22 08:09 Joffcom

@Joffcom I think roughly around 100 per minute, that would be around 150K per day, and that means that the issue is popping after 1M message approx.

Of course this is for all 40 devices in the MQTT network, perhaps it is only the target messages that matter. Then it may be much less, as only 3 of the 40 devices are involved.

rvalle avatar Sep 15 '22 08:09 rvalle

@rvalle cheers, I realised my testing was a bit too simple so just have the bombing script a go and killed my n8n instance so I will need to bump up the memory a bit on that but I suspect this may come down to resources and something odd in the package we are using.

I don't think this is going to be a quick fix at the moment but I have created N8N-4647 which is our internal reference for tracking this one.

Joffcom avatar Sep 15 '22 09:09 Joffcom

Hey @rvalle,

I was just chatting to someone who knows a lot more about MQTT than me and they have suggested setting the QOS to 2 and see if the same issue occurs.

Joffcom avatar Sep 15 '22 11:09 Joffcom

@Joffcom I can surely try, it but I doubt it will work, because when I connect, in parallel MQTT Explorer with a similar QOS it gets only one copy.

But you never know, does not cost anything to try. Will setup one of the listeners with QOS 2 and see the impact.

On another note, I can confirm that it keeps happening after restarts, I was also thinking that I could try to enable remote debugging on my container and connect once the bug is active to open a debug session and try to gain some more insights.

rvalle avatar Sep 29 '22 09:09 rvalle

I am seeing exactly the same behavior with the redis trigger, so this might not be a MQTT only issue.

kevinrademan avatar Oct 12 '22 08:10 kevinrademan

@kevinrademan That is interesting, So are you seeing the same message being pulled multiple times? I would be surprised if it is the same cause but you never know.

Joffcom avatar Oct 12 '22 09:10 Joffcom

@Joffcom I would'nt call it pulled multiple times. Its almost like the redis trigger doesnt cleanup some kind of queue resulting in previous executions of the trigger running again. So over time the amount of jobs that gets executed gets more and more.

Heres my usecase. Im doing some local development on a shopify integration and required the webhooks to reach my machine. We have a self hosted version of n8n which recieves the shopify webhook, and just publishes the raw data into a redis queue.

Locally I also have n8n running which uses the redis trigger to receive messages published to a specific channel and then uses the http request node to send along the data to my local dev env.

When I check the execution logs for the publically accessible n8n instance, I can see that 1 shopify webhook request results in 1 entry into the execution history. (I can also see that it only publishes 1 even to redis)

Locally however that 1 execution causes up to 20 executions of my workflow. It also seems to get worse over time which makes me think this is some queue or event listener thats not getting cleaned up. It however only executes the 20 jobs when a shopify webhook comes it. So its 1 shopify webhook trigger that causes the 20 jobs to execute. And for the record it doesnt start at 20 jobs. Initially it works and just executes fine, but as time goes along it causes more and more executions.

I hope that explains the issue. I'd be more than happy to provide any debug information you might need

kevinrademan avatar Oct 13 '22 11:10 kevinrademan

Hey @kevinrademan,

What is the difference between the 2 instances? I would expect both the public and development instances to have the same problem unless maybe one is docker or npm and the other is desktop.

Joffcom avatar Oct 13 '22 11:10 Joffcom

@Joffcom The public instance is hosted in a kubernetes cluster that is publically accessible. This instance has a webhook trigger and a redis node option in the workflow (i've attached a screenshot) image

The other instance is hosted in docker desktops kubernetes locally and uses a Redis Trigger and a http request module. image

The public one works exactly as expected, 1 webhook results in 1 execution of the redis node. Locally however when that redis message comes in the workflow is executed arbitrary amount of times that seems to increase over time.

(edit) Might be worth mentioning that I use this helm chart to deploy n8n https://github.com/8gears/n8n-helm-chart

kevinrademan avatar Oct 13 '22 11:10 kevinrademan

@Joffcom Here is some screenshots of the execution. I created this simple flow which allows me publish 1 message to redis image

And here is an example of what happens locally when I run that workflow (this was after a restart of the pod, more information around that below) image

Finally, restarting the pod running n8n temporarily "fixes" the issue. However over time it starts to increase the number of executions again. Looking at the code I get the feeling there might be some edge cases resuling in the closeFunction not getting called. That would cause the behavior im seeing. https://github.com/n8n-io/n8n/blob/91d7e16c81675ec1573196271c79cb67503f2ab5/packages/nodes-base/nodes/Redis/RedisTrigger.node.ts#L127

The other thing I've noticed is that locally when I pull the logs I repetely see the "Loading configuration overwrites from" logs being written. image

Perhaps some kind of soft restart isnt cleaning up the redis client correctly? ITs worth noting that the pod itself isn't restarting, so would have to be something inside the docker container.

kevinrademan avatar Oct 13 '22 12:10 kevinrademan

I enabled debug logging, but no useful information there. Noticed that the Loading conversation logs happen when the sub processes start up so its not a restart causing the leak. The only logs that happened was the health check and this one which doesn't seem related.

2022-10-13T12:57:19.443Z | debug    | Wait tracker querying database for waiting executions "{ file: 'WaitTracker.js', function: 'getWaitingExecutions' }

kevinrademan avatar Oct 13 '22 13:10 kevinrademan

Hey @kevinrademan,

Looking at this I don't believe it is related to the MQTT trigger issue, I take it the public instance is not using a Redis trigger at all and is just posting the data so the local instance can pick it up?

How many n8n instances are running in the local instance? I am wondering if maybe there are multiple workers and they are picking up the same message.

Joffcom avatar Oct 13 '22 16:10 Joffcom