n8n
n8n copied to clipboard
Executions running forever - Can't cancel
Describe the bug Some workflow executions are running forever and can't be canceled nor stopped. The behavior seems to be random, because affects to any workflow.
Most of the executions goes well, but some keep running forever. All workflows had a timeout on its settings.
It is happening along different n8n versions up until 1.15.2 (last time happened).
Attached screenshots are below.
To Reproduce Steps to reproduce the behavior:
- Unknown. No steps found to reproduce it.
Expected behavior
- Do not keep running.
- Respect the timeout settings.
- Raise an error.
Environment (please complete the following information):
- OS: Ubuntu Linux 23.10
- n8n Version: 1.15.2
- Node.js Version: ?
- Database system: postgresql
- Operation mode: ? [e.g. own; operation modes are
own
,main
andqueue
. Default ismain
]
Additional context
Screenshot: Running forever
Screenshot: Can't cancel
Already checked #5561 and #7442, but doesn't seem the same issue.
The only way that I've found to stop them is to restarting the docker stack (using portainer).
Hey @pablorq,
That error looks like it may already have been stopped in the background, Can you share more about your setup? Are you using a reverse proxy, do you have cloudflare or similar involved?
Next time you see this issue if you connect using the direct IP of your n8n container it would be interesting to see if you see the same thing. Are you also running in queue mode or do you just have the one n8n instance running?
Hi @Joffcom,
I'm running n8n in a simple way, just an local stack with one instance.
When you said "stopped in the background", do you mean that the task was already stopped but not reported to the UI? Is there any check or debug that I can run inside the container?
This is my docker-compose like file (I'm using Portainer):
# n8n with persistent volume Postgres
# https://raw.githubusercontent.com/n8n-io/n8n/master/docker/compose/withPostgres/docker-compose.yml
version: '3.8'
volumes:
db_data:
main_data:
services:
postgres:
image: postgres:11
restart: always
environment:
- POSTGRES_USER=${POSTGRES_USER}
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
- POSTGRES_DB=${POSTGRES_DB}
- POSTGRES_NON_ROOT_USER=${POSTGRES_NON_ROOT_USER}
- POSTGRES_NON_ROOT_PASSWORD=${POSTGRES_NON_ROOT_PASSWORD}
volumes:
- db_data:/var/lib/postgresql/data
- ${N8N_LOCAL_FILES}/n8n-config/init-data.sh:/docker-entrypoint-initdb.d/init-data.sh
healthcheck:
test: ['CMD-SHELL', 'pg_isready -h localhost -U ${POSTGRES_USER} -d ${POSTGRES_DB}']
interval: 5s
timeout: 5s
retries: 10
main:
image: docker.n8n.io/n8nio/n8n:${N8N_VERSION}
restart: always
environment:
- DB_TYPE=postgresdb
- DB_POSTGRESDB_HOST=postgres
- DB_POSTGRESDB_PORT=5432
- DB_POSTGRESDB_DATABASE=${POSTGRES_DB}
- DB_POSTGRESDB_USER=${POSTGRES_NON_ROOT_USER}
- DB_POSTGRESDB_PASSWORD=${POSTGRES_NON_ROOT_PASSWORD}
- GENERIC_TIMEZONE=Europe/Madrid
- TZ=Europe/Madrid
- NODE_FUNCTION_ALLOW_EXTERNAL=*
ports:
- 5678:5678
links:
- postgres
volumes:
- main_data:/home/node/.n8n
- ${N8N_LOCAL_FILES}:/files
depends_on:
postgres:
condition: service_healthy
user: root
entrypoint: sh -c
command:
- |
npm install -g @mozilla/readability jsdom
su node -c "tini -s -- /docker-entrypoint.sh &"
sleep infinity
extra_hosts:
# Access host localhost
- "local.host:host-gateway"
# environment
# N8N_VERSION=1.15.2
As you can see, I've to modify the entrypoint to install some Code.js modules, but it shouldn't break anything.
By the way, last week I've updated to v1.16.0. By now, everything is working ok.
Hey @pablorq,
That is good news so this was fixed by updating?
I'm afraid not. It's happening again:
I'll not to restart the container, to be able to do some tests.
Can you please tell me what to test?
Remember that all workflows have an timeout of 5 or 10 minutes, which seems not to been respected.
@pablorq I have experienced a similar issue - do you have a node being used by all of those workflows that may be the source?
It seems like if a node breaks/exceptions in an unexpected way, it can fail to return execution control back to the workflow. Instead, the node is stuck as 'executing' and the workflow waits forever for it to finish.
I'm working on a custom programmatic node and ensuring that the node fails gracefully has been a 'gotcha'.
@jmaurer1994 yes, the "... Create script" workflow is called as last in a chain.
But the point is that it (and all of them) has set an Error Workflow to report an error, and a timeout to avoid it to be running forever.
The point here is that the timeout mechanism isn't working well.
That's why I've asked instructions to do some test.
(Note: this is not affecting to my overall tasks. I've taken some countermeasure actions having in mind this issue.)
Hey @pablorq,
If you open one of the workflows and try to cancel it what happens do you get the same message about it not being possible?
Can you also check in your database to see if the database is still showing the execution as running? Something like select * from execution_entity where id = 6684;
would show one of them.
This is the status of the 6 of them:
n8n=> select * from execution_entity where id = 6684;
id | finished | mode | retryOf | retrySuccessId | startedAt | stoppedAt | waitTill | status | workflowId | deletedAt
------+----------+------------+---------+----------------+---------------------------+-----------+----------+--------+------------------+-----------
6684 | f | integrated | | | 2023-11-29 23:49:05.63+00 | | | new | bjGeTJmgkCB2B45l |
(1 row)
n8n=> select * from execution_entity where id = 6683;
id | finished | mode | retryOf | retrySuccessId | startedAt | stoppedAt | waitTill | status | workflowId | deletedAt
------+----------+------------+---------+----------------+---------------------------+-----------+----------+--------+------------------+-----------
6683 | f | integrated | | | 2023-11-29 23:49:02.97+00 | | | new | kVViRcmElddbWnvh |
(1 row)
n8n=> select * from execution_entity where id = 6682;
id | finished | mode | retryOf | retrySuccessId | startedAt | stoppedAt | waitTill | status | workflowId | deletedAt
------+----------+---------+---------+----------------+----------------------------+-----------+----------+---------+------------------+-----------
6682 | f | trigger | | | 2023-11-29 23:49:02.942+00 | | | running | ikCLxLlGsCGWEwdA |
(1 row)
n8n=> select * from execution_entity where id = 6665;
id | finished | mode | retryOf | retrySuccessId | startedAt | stoppedAt | waitTill | status | workflowId | deletedAt
------+----------+------------+---------+----------------+----------------------------+-----------+----------+--------+------------------+-----------
6665 | f | integrated | | | 2023-11-29 20:40:07.133+00 | | | new | bjGeTJmgkCB2B45l |
(1 row)
n8n=> select * from execution_entity where id = 6664;
id | finished | mode | retryOf | retrySuccessId | startedAt | stoppedAt | waitTill | status | workflowId | deletedAt
------+----------+------------+---------+----------------+----------------------------+-----------+----------+--------+------------------+-----------
6664 | f | integrated | | | 2023-11-29 20:40:03.071+00 | | | new | kVViRcmElddbWnvh |
(1 row)
n8n=> select * from execution_entity where id = 6663;
id | finished | mode | retryOf | retrySuccessId | startedAt | stoppedAt | waitTill | status | workflowId | deletedAt
------+----------+---------+---------+----------------+----------------------------+-----------+----------+---------+------------------+-----------
6663 | f | trigger | | | 2023-11-29 20:40:03.045+00 | | | running | ikCLxLlGsCGWEwdA |
(1 row)
I did this video for trying to stop the workflows: Screencast from 2023-12-01 15-25-52.webm
In the video you can see the "Problem stopping execution" warning.
After trying to stop them, I've checked the DB and everything is still the same.
Any other test?
Hey @pablorq,
It looks like the job has not been started yet, That is very odd. I was sure that we had fixed this recently as well 🤔 Can you go up to 1.18.1 and enable debug logging?
Upgraded to v1.18.2. I'll be waiting to see if this happens again or not.
Debug
I've setup the debug with the data in the n8n documentation, changing the log file:
# Set the logging level to 'debug'
- N8N_LOG_LEVEL=debug
# Set log output to both console and a log file
- N8N_LOG_OUTPUT=console,file
# Set a save location for the log file
- N8N_LOG_FILE_LOCATION=/home/node/.n8n/logs/n8n.log
# Set a 50 MB maximum size for each log file
- N8N_LOG_FILE_MAXSIZE=50
# Set 60 as the maximum number of log files to be kept
- N8N_LOG_FILE_MAXCOUNT=60
Now, I have logs in the console (container log) but I have not a log file.
After looking in detail, the "logs" directory was created with owner root:node
.
I don't know if it was created by the app, docker or by me. Maybe this rejected the creation of the log file by the user node
.
Changed manually to node:node
.
Forever Running Executions
I have 2 running executions right now:
They have more than 2800min, which exceed the 10min of timeout in each workflow settings.
Of course, trying to stop those executions didn't work:
Logs
This time, we have the logs from the console:
Console log
2023-12-07T14:46:37.119Z [Rudder] debug: cancelling existing timer...
2023-12-07T14:46:37.119Z [Rudder] debug: cancelling existing flushTimer...
2023-12-07T14:46:37.119Z [Rudder] debug: batch size is 1
2023-12-07T14:46:53.006Z | debug | Polling trigger initiated for workflow "RecaLinux - Episode - RSS posts 9to5" "{\n workflowName: 'RecaLinux - Episode - RSS posts 9to5',\n workflowId: 'ikCLxLlGsCGWEwdA',\n file: 'LoggerProxy.js',\n function: 'exports.debug'\n}"
2023-12-07T14:46:57.312Z [Rudder] debug: in flush
2023-12-07T14:46:57.313Z [Rudder] debug: cancelling existing timer...
2023-12-07T14:46:57.313Z [Rudder] debug: queue is empty, nothing to flush
2023-12-07T14:47:14.920Z | debug | Wait tracker querying database for waiting executions "{ file: 'WaitTracker.js', function: 'getWaitingExecutions' }"
2023-12-07T14:47:23.010Z | debug | Polling trigger initiated for workflow "RecaLinux - Episode - RSS posts linuxgameconsortium" "{\n workflowName: 'RecaLinux - Episode - RSS posts linuxgameconsortium',\n workflowId: 'T0QLDQixV0i2RRZz',\n file: 'LoggerProxy.js',\n function: 'exports.debug'\n}"
2023-12-07T14:47:24.895Z | debug | Running node "Wait" finished successfully "{\n node: 'Wait',\n workflowId: 'kVViRcmElddbWnvh',\n file: 'LoggerProxy.js',\n function: 'exports.debug'\n}"
2023-12-07T14:47:24.897Z | debug | Start processing node "call_create_script" "{\n node: 'call_create_script',\n workflowId: 'kVViRcmElddbWnvh',\n file: 'LoggerProxy.js',\n function: 'exports.debug'\n}"
2023-12-07T14:47:24.898Z | debug | Running node "call_create_script" started "{\n node: 'call_create_script',\n workflowId: 'kVViRcmElddbWnvh',\n file: 'LoggerProxy.js',\n function: 'exports.debug'\n}"
2023-12-07T14:47:24.958Z | verbose | Workflow execution started "{\n workflowId: 'bjGeTJmgkCB2B45l',\n file: 'LoggerProxy.js',\n function: 'exports.verbose'\n}"
2023-12-07T14:47:24.963Z | debug | Start processing node "When clicking "Execute Workflow"" "{\n node: 'When clicking \"Execute Workflow\"',\n workflowId: 'bjGeTJmgkCB2B45l',\n file: 'LoggerProxy.js',\n function: 'exports.debug'\n}"
2023-12-07T14:47:24.964Z | debug | Running node "When clicking "Execute Workflow"" started "{\n node: 'When clicking \"Execute Workflow\"',\n workflowId: 'bjGeTJmgkCB2B45l',\n file: 'LoggerProxy.js',\n function: 'exports.debug'\n}"
2023-12-07T14:47:24.964Z | debug | Running node "When clicking "Execute Workflow"" finished successfully "{\n node: 'When clicking \"Execute Workflow\"',\n workflowId: 'bjGeTJmgkCB2B45l',\n file: 'LoggerProxy.js',\n function: 'exports.debug'\n}"
2023-12-07T14:47:24.965Z | debug | Start processing node "source" "{\n node: 'source',\n workflowId: 'bjGeTJmgkCB2B45l',\n file: 'LoggerProxy.js',\n function: 'exports.debug'\n}"
2023-12-07T14:47:24.966Z | debug | Running node "source" started "{\n node: 'source',\n workflowId: 'bjGeTJmgkCB2B45l',\n file: 'LoggerProxy.js',\n function: 'exports.debug'\n}"
2023-12-07T14:47:24.967Z | debug | Running node "source" finished successfully "{\n node: 'source',\n workflowId: 'bjGeTJmgkCB2B45l',\n file: 'LoggerProxy.js',\n function: 'exports.debug'\n}"
2023-12-07T14:47:24.968Z | debug | Start processing node "HTTP Request" "{\n node: 'HTTP Request',\n workflowId: 'bjGeTJmgkCB2B45l',\n file: 'LoggerProxy.js',\n function: 'exports.debug'\n}"
2023-12-07T14:47:24.969Z | debug | Running node "HTTP Request" started "{\n node: 'HTTP Request',\n workflowId: 'bjGeTJmgkCB2B45l',\n file: 'LoggerProxy.js',\n function: 'exports.debug'\n}"
2023-12-07T14:47:24.975Z | debug | Proxying request to axios "{ file: 'LoggerProxy.js', function: 'exports.debug' }"
2023-12-07T14:47:25.198Z | debug | Running node "HTTP Request" finished successfully "{\n node: 'HTTP Request',\n workflowId: 'bjGeTJmgkCB2B45l',\n file: 'LoggerProxy.js',\n function: 'exports.debug'\n}"
2023-12-07T14:47:25.201Z | debug | Start processing node "read_article" "{\n node: 'read_article',\n workflowId: 'bjGeTJmgkCB2B45l',\n file: 'LoggerProxy.js',\n function: 'exports.debug'\n}"
2023-12-07T14:47:25.207Z | debug | Running node "read_article" started "{\n node: 'read_article',\n workflowId: 'bjGeTJmgkCB2B45l',\n file: 'LoggerProxy.js',\n function: 'exports.debug'\n}"
2023-12-07T14:47:27.160Z | debug | Running node "read_article" finished successfully "{\n node: 'read_article',\n workflowId: 'bjGeTJmgkCB2B45l',\n file: 'LoggerProxy.js',\n function: 'exports.debug'\n}"
2023-12-07T14:47:27.163Z | debug | Start processing node "tech_summary" "{\n node: 'tech_summary',\n workflowId: 'bjGeTJmgkCB2B45l',\n file: 'LoggerProxy.js',\n function: 'exports.debug'\n}"
2023-12-07T14:47:27.164Z | debug | Running node "tech_summary" started "{\n node: 'tech_summary',\n workflowId: 'bjGeTJmgkCB2B45l',\n file: 'LoggerProxy.js',\n function: 'exports.debug'\n}"
2023-12-07T14:47:34.008Z | debug | Polling trigger initiated for workflow "RecaLinux - Episode - RSS posts omgubuntu" "{\n workflowName: 'RecaLinux - Episode - RSS posts omgubuntu',\n workflowId: 'OyDUydZ7i8DPCcp2',\n file: 'LoggerProxy.js',\n function: 'exports.debug'\n}"
2023-12-07T14:47:35.932Z | debug | Running node "tech_summary" finished successfully "{\n node: 'tech_summary',\n workflowId: 'bjGeTJmgkCB2B45l',\n file: 'LoggerProxy.js',\n function: 'exports.debug'\n}"
Deprecation warning: value provided is not in a recognized RFC2822 or ISO format. moment construction falls back to js Date(), which is not reliable across all browsers and versions. Non RFC2822/ISO date formats are discouraged. Please refer to http://momentjs.com/guides/#/warnings/js-date/ for more info.
Arguments:
[0] _isAMomentObject: true, _isUTC: false, _useUTC: false, _l: undefined, _i: Thu Dec 07 2023 16:00:00 GMT+0100 (Central European Standard Time), _f: undefined, _strict: undefined, _locale: [object Object]
Error:
at Function.createFromInputFallback (/usr/local/lib/node_modules/n8n/node_modules/moment/moment.js:324:25)
at configFromString (/usr/local/lib/node_modules/n8n/node_modules/moment/moment.js:2550:19)
at configFromInput (/usr/local/lib/node_modules/n8n/node_modules/moment/moment.js:2993:13)
at prepareConfig (/usr/local/lib/node_modules/n8n/node_modules/moment/moment.js:2976:13)
at createFromConfig (/usr/local/lib/node_modules/n8n/node_modules/moment/moment.js:2943:44)
at createLocalOrUTC (/usr/local/lib/node_modules/n8n/node_modules/moment/moment.js:3037:16)
at createLocal (/usr/local/lib/node_modules/n8n/node_modules/moment/moment.js:3041:16)
at hooks (/usr/local/lib/node_modules/n8n/node_modules/moment/moment.js:16:29)
at Object.execute (/usr/local/lib/node_modules/n8n/node_modules/n8n-nodes-base/dist/nodes/DateTime/V1/DateTimeV1.node.js:409:92)
at Workflow.runNode (/usr/local/lib/node_modules/n8n/node_modules/n8n-workflow/dist/Workflow.js:672:42)
2023-12-08T11:00:07.317Z [Rudder] debug: no existing flush timer, creating new one
2023-12-08T11:00:27.321Z [Rudder] debug: in flush
2023-12-08T11:00:27.321Z [Rudder] debug: cancelling existing flushTimer...
2023-12-08T11:00:27.321Z [Rudder] debug: batch size is 12
2023-12-08T11:00:47.740Z [Rudder] debug: in flush
2023-12-08T11:00:47.740Z [Rudder] debug: cancelling existing timer...
2023-12-08T11:00:47.740Z [Rudder] debug: queue is empty, nothing to flush
2023-12-08T17:00:07.315Z [Rudder] debug: no existing flush timer, creating new one
2023-12-08T17:00:27.316Z [Rudder] debug: in flush
2023-12-08T17:00:27.316Z [Rudder] debug: cancelling existing flushTimer...
2023-12-08T17:00:27.316Z [Rudder] debug: batch size is 6
2023-12-08T17:00:47.564Z [Rudder] debug: in flush
2023-12-08T17:00:47.564Z [Rudder] debug: cancelling existing timer...
2023-12-08T17:00:47.564Z [Rudder] debug: queue is empty, nothing to flush
2023-12-08T23:00:07.315Z [Rudder] debug: no existing flush timer, creating new one
2023-12-08T23:00:27.316Z [Rudder] debug: in flush
2023-12-08T23:00:27.316Z [Rudder] debug: cancelling existing flushTimer...
2023-12-08T23:00:27.317Z [Rudder] debug: batch size is 6
2023-12-08T23:00:47.560Z [Rudder] debug: in flush
2023-12-08T23:00:47.561Z [Rudder] debug: cancelling existing timer...
2023-12-08T23:00:47.561Z [Rudder] debug: queue is empty, nothing to flush
2023-12-09T05:00:07.314Z [Rudder] debug: no existing flush timer, creating new one
2023-12-09T05:00:27.317Z [Rudder] debug: in flush
2023-12-09T05:00:27.317Z [Rudder] debug: cancelling existing flushTimer...
2023-12-09T05:00:27.317Z [Rudder] debug: batch size is 6
2023-12-09T05:00:47.512Z [Rudder] debug: in flush
2023-12-09T05:00:47.512Z [Rudder] debug: cancelling existing timer...
2023-12-09T05:00:47.512Z [Rudder] debug: queue is empty, nothing to flush
2023-12-09T11:00:07.439Z [Rudder] debug: no existing flush timer, creating new one
2023-12-09T11:00:27.440Z [Rudder] debug: in flush
2023-12-09T11:00:27.441Z [Rudder] debug: cancelling existing flushTimer...
2023-12-09T11:00:27.441Z [Rudder] debug: batch size is 6
2023-12-09T11:00:47.701Z [Rudder] debug: in flush
2023-12-09T11:00:47.701Z [Rudder] debug: cancelling existing timer...
2023-12-09T11:00:47.702Z [Rudder] debug: queue is empty, nothing to flush
2023-12-09T17:00:07.441Z [Rudder] debug: no existing flush timer, creating new one
2023-12-09T17:00:27.442Z [Rudder] debug: in flush
2023-12-09T17:00:27.443Z [Rudder] debug: cancelling existing flushTimer...
2023-12-09T17:00:27.444Z [Rudder] debug: batch size is 6
2023-12-09T17:00:47.649Z [Rudder] debug: in flush
2023-12-09T17:00:47.649Z [Rudder] debug: cancelling existing timer...
2023-12-09T17:00:47.649Z [Rudder] debug: queue is empty, nothing to flush
2023-12-09T23:00:07.441Z [Rudder] debug: no existing flush timer, creating new one
2023-12-09T23:00:27.444Z [Rudder] debug: in flush
2023-12-09T23:00:27.444Z [Rudder] debug: cancelling existing flushTimer...
2023-12-09T23:00:27.444Z [Rudder] debug: batch size is 6
2023-12-09T23:00:47.706Z [Rudder] debug: in flush
2023-12-09T23:00:47.706Z [Rudder] debug: cancelling existing timer...
2023-12-09T23:00:47.707Z [Rudder] debug: queue is empty, nothing to flush
2023-12-10T05:00:07.441Z [Rudder] debug: no existing flush timer, creating new one
2023-12-10T05:00:27.444Z [Rudder] debug: in flush
2023-12-10T05:00:27.444Z [Rudder] debug: cancelling existing flushTimer...
2023-12-10T05:00:27.445Z [Rudder] debug: batch size is 10
2023-12-10T05:00:47.916Z [Rudder] debug: in flush
2023-12-10T05:00:47.916Z [Rudder] debug: cancelling existing timer...
2023-12-10T05:00:47.916Z [Rudder] debug: queue is empty, nothing to flush
I see now a deprecation warning, but I think this shouldn't affect.
Database info
This is the database info related to the running workflows:
psql (11.16 (Debian 11.16-1.pgdg90+1))
Type "help" for help.
n8n=> select * from execution_entity where id = 7617;
id | finished | mode | retryOf | retrySuccessId | startedAt | stoppedAt | waitTill | status | workflowId | deletedAt
------+----------+------------+---------+----------------+----------------------------+-----------+----------+--------+------------------+-----------
7617 | f | integrated | | | 2023-12-08 07:23:56.043+00 | | | new | kVViRcmElddbWnvh |
(1 row)
n8n=> select * from execution_entity where id = 7620;
id | finished | mode | retryOf | retrySuccessId | startedAt | stoppedAt | waitTill | status | workflowId | deletedAt
------+----------+------------+---------+----------------+----------------------------+-----------+----------+--------+------------------+-----------
7620 | f | integrated | | | 2023-12-08 07:24:58.313+00 | | | new | bjGeTJmgkCB2B45l |
(1 row)
Just to be sure, I've checked on a finished execution:
n8n=> select * from execution_entity where id = 7808;
id | finished | mode | retryOf | retrySuccessId | startedAt | stoppedAt | waitTill | status | workflowId | deletedAt
------+----------+------------+---------+----------------+---------------------------+----------------------------+----------+---------+------------------+-----------
7808 | t | integrated | | | 2023-12-10 08:00:02.89+00 | 2023-12-10 08:00:14.643+00 | | success | CN2UvfSE6TqqIgeI |
(1 row)
Actual Status
By now, the timeout setting in each workflow doesn't seems to work, neither the button in the UI to manually stop each one of them. I saw in the n8n documentation an environment variable to set a max timeout for all workflows, but I'm not using it Any other test to do?
Hey @pablorq,
Can you share the workflows that are causing the issues so we can test them? Our timeout is likely not kicking in becuase something is blocking it possibly a code node?
Hi @Joffcom !
I do use some code node, most of them in python.
How do you want to share the workflows? Export them and paste? Or how?
Hey @pablorq,
Export and paste would be ideal.
Detecting the issue origin
I've doing come tests to narrow the problem to a specific workflow. This workflow is "xxx - Create script" and is called by the "xxx - Save list" workflow, which is called by another one. That's why both (or more) workflows gets FRE (Forever Running Execution).
This workflow also uses 2 external Node libraries: @mozilla/readability
and jsdom
(along with other nodes).
Today I've split that workflow in two different, isolating more the code node that use those libraries from the rest of nodes, to try to define the cause of the problem.
Timeout issue
But among the root cause of this problem, there's another issues.
One of those issues is related to the timeout. Ok, maybe there's something in the code node that keeps a node running. But there's an specific timeout setting in the workflow to avoid the node kept running.
And that timeout isn't working.
Can't cancel issue
Another problem is about to cancel a FRE. There's a convenient "Cancel" button in the execution, but it doesn't work neither.
In summary
At the end, there's 2 emergency measures to avoid the FRE issue, but neither of them works.
This is the last log file: n8n-jsdom-fre-231214-0835.log
Hope it could help to find the root cause of the FRE, but also to solve the timeout and cancel issues.
Hey @pablorq,
The Timeout workflow option only kicks in once a node has finished so we won't interrupt a currently working workflow this means if you have a node that is running and it is stuck you will run into this and the timeout option won't help. In this case to fix the long running executions we will need to work out which node is causing it.
Hi @Joffcom !
Timeout
Ok. So the timeout is called between nodes. I guess that the timeout environment variable works in the same way. Is it?
If it also works in the same way, I understand that this is not a solution.
Cancel Execution button
It doesn't work neither. I would like to think that this cancel button interrupts the working workflow. Or it doesn't work this way?
If this button is called between nodes, as the timeout, I think it should be reviewed to kill the working workflow.
Not having this option, the only way to "stop" the FRE is restarting the stack/server which is not a good ides in production stage.
Root cause
I'm doing more tests. Right now I'm waiting for the next FRE to reduce the issue to a few nodes.
By the way, did you find something in the logs?
Root cause Found!
... But I can't believe it!
It's a HTTP Request.
Screenshots:
This is the workflow: RecaLinux___Episode___Create_script___Get_Webpage.json
Can affect to be a nested sub-worfklow? The parent workflow is RecaLinux - Episode - RSS posts 9to5
, which calls to RecaLinux - Episode - Save list
, which calls RecaLinux - Episode - Create script
. Normally here should end the nested workflows, but for the test I split this last workflow and created RecaLinux - Episode - Get Webpage
, so this one is the last one called.
Anyway, with 3 levels of nested workflows, it failed. And also with 4 levels of nesting.
As you can see, there's no special configuration with the HTTP Request, it's just a GET to get the webpage.
Any ideas?
Yes got the same issue on 1.20.0 in docker queue mode now. Executions sometimes sometimes are showing up running forever. Trying to stop them results in:
**Problem stopping execution**
Could not stop job because it is no longer in queue.
I was able to manually reproduce the issue, during some minutes.
It seems that the server didn't respond to the HTTP Request node and then it simply keeps waiting for a response.
Executions page:
HTTP node:
Installed curl
inside the container and tested this:
curl -vvv -X GET https://9to5linux.com/firefox-122-enters-public-beta-testing-with-improved-built-in-translation-feature
Yes, the -X GET
is not necessary and it's the default for curl, but is to "force" it to act like the HTTP node. And the command works well.
After some minutes of testing, the HTTP node started to work. This means that the pending executions keeps running but the new node executions gets results from the server.
As the curl command is not installed by default, I guess n8n is using another way to send HTTP commands, an maybe that's the problem.
Anyway, having some kind of watchdog in case the server never respond could be a good idea.
Oops!
Find the timeout option in the HTTP node. I've added a timeout of 2min to see if it helps!
Well, I'm afraid the timeout option in the HTTP node doesn't work:
Hey @pablorq,
Can you share a minimal workflow that reproduces the issue with the http request node? I have just tested calling the 9to5linux url and it is working with no issues.
You are correct on one of your earlier comments we do not use curl we use a nodejs package called Axios for all of our http connections. I wonder if for some reason we have not implemented the timeout properly in one of the http request node versions 🤔
I know we did very recently (last couple of versions) upgrade Axios but that was after 1.20.0 I think. This is the workflow I am running on 1.25.0
which seems to be working as expected.
With the cancel button it depends on which node the workflow is running, But typically any stop event happens during the checks between nodes.
Hi @Joffcom !
I have pending to report about this issue.
A couple of weeks ago I've updated to v1.21.1. Since that, I've no more problems with FRE.
It could be the Axios package that you've updated, so for now it seems to be fixed.
Thank you!