n8n icon indicating copy to clipboard operation
n8n copied to clipboard

Executions running forever - Can't cancel

Open pablorq opened this issue 1 year ago • 26 comments

Describe the bug Some workflow executions are running forever and can't be canceled nor stopped. The behavior seems to be random, because affects to any workflow.

Most of the executions goes well, but some keep running forever. All workflows had a timeout on its settings.

It is happening along different n8n versions up until 1.15.2 (last time happened).

Attached screenshots are below.

To Reproduce Steps to reproduce the behavior:

  • Unknown. No steps found to reproduce it.

Expected behavior

  1. Do not keep running.
  2. Respect the timeout settings.
  3. Raise an error.

Environment (please complete the following information):

  • OS: Ubuntu Linux 23.10
  • n8n Version: 1.15.2
  • Node.js Version: ?
  • Database system: postgresql
  • Operation mode: ? [e.g. own; operation modes are own, main and queue. Default is main]

Additional context Screenshot: Running forever Screenshot 2023-11-17 at 15-33-43 n8n - Executions 02

Screenshot: Can't cancel Screenshot 2023-11-17 at 15-34-40 n8n - Executions 02

pablorq avatar Nov 17 '23 15:11 pablorq

Already checked #5561 and #7442, but doesn't seem the same issue.

pablorq avatar Nov 17 '23 15:11 pablorq

The only way that I've found to stop them is to restarting the docker stack (using portainer).

pablorq avatar Nov 17 '23 15:11 pablorq

Hey @pablorq,

That error looks like it may already have been stopped in the background, Can you share more about your setup? Are you using a reverse proxy, do you have cloudflare or similar involved?

Next time you see this issue if you connect using the direct IP of your n8n container it would be interesting to see if you see the same thing. Are you also running in queue mode or do you just have the one n8n instance running?

Joffcom avatar Nov 17 '23 16:11 Joffcom

Hi @Joffcom,

I'm running n8n in a simple way, just an local stack with one instance.

When you said "stopped in the background", do you mean that the task was already stopped but not reported to the UI? Is there any check or debug that I can run inside the container?

This is my docker-compose like file (I'm using Portainer):

# n8n with persistent volume Postgres
# https://raw.githubusercontent.com/n8n-io/n8n/master/docker/compose/withPostgres/docker-compose.yml

version: '3.8'

volumes:
  db_data:
  main_data:

services:
  postgres:
    image: postgres:11
    restart: always
    environment:
      - POSTGRES_USER=${POSTGRES_USER}
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
      - POSTGRES_DB=${POSTGRES_DB}
      - POSTGRES_NON_ROOT_USER=${POSTGRES_NON_ROOT_USER}
      - POSTGRES_NON_ROOT_PASSWORD=${POSTGRES_NON_ROOT_PASSWORD}
    volumes:
      - db_data:/var/lib/postgresql/data
      - ${N8N_LOCAL_FILES}/n8n-config/init-data.sh:/docker-entrypoint-initdb.d/init-data.sh
    healthcheck:
      test: ['CMD-SHELL', 'pg_isready -h localhost -U ${POSTGRES_USER} -d ${POSTGRES_DB}']
      interval: 5s
      timeout: 5s
      retries: 10

  main:
    image: docker.n8n.io/n8nio/n8n:${N8N_VERSION}
    restart: always
    environment:
      - DB_TYPE=postgresdb
      - DB_POSTGRESDB_HOST=postgres
      - DB_POSTGRESDB_PORT=5432
      - DB_POSTGRESDB_DATABASE=${POSTGRES_DB}
      - DB_POSTGRESDB_USER=${POSTGRES_NON_ROOT_USER}
      - DB_POSTGRESDB_PASSWORD=${POSTGRES_NON_ROOT_PASSWORD}
      - GENERIC_TIMEZONE=Europe/Madrid
      - TZ=Europe/Madrid
      - NODE_FUNCTION_ALLOW_EXTERNAL=*
    ports:
      - 5678:5678
    links:
      - postgres
    volumes:
      - main_data:/home/node/.n8n
      - ${N8N_LOCAL_FILES}:/files
    depends_on:
      postgres:
        condition: service_healthy
    user: root
    entrypoint: sh -c
    command:
      - |
        npm install -g @mozilla/readability jsdom
        su node -c "tini -s -- /docker-entrypoint.sh &"
        sleep infinity

    extra_hosts:
    # Access host localhost
      - "local.host:host-gateway"


# environment
# N8N_VERSION=1.15.2

As you can see, I've to modify the entrypoint to install some Code.js modules, but it shouldn't break anything.

pablorq avatar Nov 23 '23 11:11 pablorq

By the way, last week I've updated to v1.16.0. By now, everything is working ok.

pablorq avatar Nov 23 '23 11:11 pablorq

Hey @pablorq,

That is good news so this was fixed by updating?

Joffcom avatar Nov 27 '23 09:11 Joffcom

I'm afraid not. It's happening again:

Screenshot 2023-11-30 at 08-34-37 n8n - Executions

I'll not to restart the container, to be able to do some tests.

Can you please tell me what to test?

Remember that all workflows have an timeout of 5 or 10 minutes, which seems not to been respected.

pablorq avatar Nov 30 '23 07:11 pablorq

@pablorq I have experienced a similar issue - do you have a node being used by all of those workflows that may be the source?

It seems like if a node breaks/exceptions in an unexpected way, it can fail to return execution control back to the workflow. Instead, the node is stuck as 'executing' and the workflow waits forever for it to finish.

I'm working on a custom programmatic node and ensuring that the node fails gracefully has been a 'gotcha'.

jmaurer1994 avatar Nov 30 '23 18:11 jmaurer1994

@jmaurer1994 yes, the "... Create script" workflow is called as last in a chain.

But the point is that it (and all of them) has set an Error Workflow to report an error, and a timeout to avoid it to be running forever.

The point here is that the timeout mechanism isn't working well.

That's why I've asked instructions to do some test.

(Note: this is not affecting to my overall tasks. I've taken some countermeasure actions having in mind this issue.)

pablorq avatar Nov 30 '23 21:11 pablorq

Hey @pablorq,

If you open one of the workflows and try to cancel it what happens do you get the same message about it not being possible?

Can you also check in your database to see if the database is still showing the execution as running? Something like select * from execution_entity where id = 6684; would show one of them.

Joffcom avatar Dec 01 '23 12:12 Joffcom

This is the status of the 6 of them:


n8n=> select * from execution_entity where id = 6684;
  id  | finished |    mode    | retryOf | retrySuccessId |         startedAt         | stoppedAt | waitTill | status |    workflowId    | deletedAt 
------+----------+------------+---------+----------------+---------------------------+-----------+----------+--------+------------------+-----------
 6684 | f        | integrated |         |                | 2023-11-29 23:49:05.63+00 |           |          | new    | bjGeTJmgkCB2B45l | 
(1 row)

n8n=> select * from execution_entity where id = 6683;
  id  | finished |    mode    | retryOf | retrySuccessId |         startedAt         | stoppedAt | waitTill | status |    workflowId    | deletedAt 
------+----------+------------+---------+----------------+---------------------------+-----------+----------+--------+------------------+-----------
 6683 | f        | integrated |         |                | 2023-11-29 23:49:02.97+00 |           |          | new    | kVViRcmElddbWnvh | 
(1 row)

n8n=> select * from execution_entity where id = 6682;
  id  | finished |  mode   | retryOf | retrySuccessId |         startedAt          | stoppedAt | waitTill | status  |    workflowId    | deletedAt 
------+----------+---------+---------+----------------+----------------------------+-----------+----------+---------+------------------+-----------
 6682 | f        | trigger |         |                | 2023-11-29 23:49:02.942+00 |           |          | running | ikCLxLlGsCGWEwdA | 
(1 row)

n8n=> select * from execution_entity where id = 6665;
  id  | finished |    mode    | retryOf | retrySuccessId |         startedAt          | stoppedAt | waitTill | status |    workflowId    | deletedAt 
------+----------+------------+---------+----------------+----------------------------+-----------+----------+--------+------------------+-----------
 6665 | f        | integrated |         |                | 2023-11-29 20:40:07.133+00 |           |          | new    | bjGeTJmgkCB2B45l | 
(1 row)

n8n=> select * from execution_entity where id = 6664;
  id  | finished |    mode    | retryOf | retrySuccessId |         startedAt          | stoppedAt | waitTill | status |    workflowId    | deletedAt 
------+----------+------------+---------+----------------+----------------------------+-----------+----------+--------+------------------+-----------
 6664 | f        | integrated |         |                | 2023-11-29 20:40:03.071+00 |           |          | new    | kVViRcmElddbWnvh | 
(1 row)

n8n=> select * from execution_entity where id = 6663;
  id  | finished |  mode   | retryOf | retrySuccessId |         startedAt          | stoppedAt | waitTill | status  |    workflowId    | deletedAt 
------+----------+---------+---------+----------------+----------------------------+-----------+----------+---------+------------------+-----------
 6663 | f        | trigger |         |                | 2023-11-29 20:40:03.045+00 |           |          | running | ikCLxLlGsCGWEwdA | 
(1 row)

I did this video for trying to stop the workflows: Screencast from 2023-12-01 15-25-52.webm

In the video you can see the "Problem stopping execution" warning. vlcsnap-2023-12-01-15h30m37s095

After trying to stop them, I've checked the DB and everything is still the same.

Any other test?

pablorq avatar Dec 01 '23 14:12 pablorq

Hey @pablorq,

It looks like the job has not been started yet, That is very odd. I was sure that we had fixed this recently as well 🤔 Can you go up to 1.18.1 and enable debug logging?

Joffcom avatar Dec 04 '23 13:12 Joffcom

Upgraded to v1.18.2. I'll be waiting to see if this happens again or not.

pablorq avatar Dec 07 '23 10:12 pablorq

Debug

I've setup the debug with the data in the n8n documentation, changing the log file:

      # Set the logging level to 'debug'
      - N8N_LOG_LEVEL=debug
      # Set log output to both console and a log file
      - N8N_LOG_OUTPUT=console,file
      # Set a save location for the log file
      - N8N_LOG_FILE_LOCATION=/home/node/.n8n/logs/n8n.log
      # Set a 50 MB maximum size for each log file
      - N8N_LOG_FILE_MAXSIZE=50
      # Set 60 as the maximum number of log files to be kept
      - N8N_LOG_FILE_MAXCOUNT=60

Now, I have logs in the console (container log) but I have not a log file.

After looking in detail, the "logs" directory was created with owner root:node. I don't know if it was created by the app, docker or by me. Maybe this rejected the creation of the log file by the user node. Changed manually to node:node.

Forever Running Executions

I have 2 running executions right now: Screenshot 2023-12-10 at 08-44-10 n8n - Executions

They have more than 2800min, which exceed the 10min of timeout in each workflow settings.

Of course, trying to stop those executions didn't work: Screenshot 2023-12-10 at 09-12-56 n8n - Executions

Logs

This time, we have the logs from the console:

Console log
2023-12-07T14:46:37.119Z [Rudder] debug: cancelling existing timer...
2023-12-07T14:46:37.119Z [Rudder] debug: cancelling existing flushTimer...
2023-12-07T14:46:37.119Z [Rudder] debug: batch size is 1
2023-12-07T14:46:53.006Z | debug    | Polling trigger initiated for workflow "RecaLinux - Episode - RSS posts 9to5" "{\n  workflowName: 'RecaLinux - Episode - RSS posts 9to5',\n  workflowId: 'ikCLxLlGsCGWEwdA',\n  file: 'LoggerProxy.js',\n  function: 'exports.debug'\n}"
2023-12-07T14:46:57.312Z [Rudder] debug: in flush
2023-12-07T14:46:57.313Z [Rudder] debug: cancelling existing timer...
2023-12-07T14:46:57.313Z [Rudder] debug: queue is empty, nothing to flush
2023-12-07T14:47:14.920Z | debug    | Wait tracker querying database for waiting executions "{ file: 'WaitTracker.js', function: 'getWaitingExecutions' }"
2023-12-07T14:47:23.010Z | debug    | Polling trigger initiated for workflow "RecaLinux - Episode - RSS posts linuxgameconsortium" "{\n  workflowName: 'RecaLinux - Episode - RSS posts linuxgameconsortium',\n  workflowId: 'T0QLDQixV0i2RRZz',\n  file: 'LoggerProxy.js',\n  function: 'exports.debug'\n}"
2023-12-07T14:47:24.895Z | debug    | Running node "Wait" finished successfully "{\n  node: 'Wait',\n  workflowId: 'kVViRcmElddbWnvh',\n  file: 'LoggerProxy.js',\n  function: 'exports.debug'\n}"
2023-12-07T14:47:24.897Z | debug    | Start processing node "call_create_script" "{\n  node: 'call_create_script',\n  workflowId: 'kVViRcmElddbWnvh',\n  file: 'LoggerProxy.js',\n  function: 'exports.debug'\n}"
2023-12-07T14:47:24.898Z | debug    | Running node "call_create_script" started "{\n  node: 'call_create_script',\n  workflowId: 'kVViRcmElddbWnvh',\n  file: 'LoggerProxy.js',\n  function: 'exports.debug'\n}"
2023-12-07T14:47:24.958Z | verbose  | Workflow execution started "{\n  workflowId: 'bjGeTJmgkCB2B45l',\n  file: 'LoggerProxy.js',\n  function: 'exports.verbose'\n}"
2023-12-07T14:47:24.963Z | debug    | Start processing node "When clicking "Execute Workflow"" "{\n  node: 'When clicking \"Execute Workflow\"',\n  workflowId: 'bjGeTJmgkCB2B45l',\n  file: 'LoggerProxy.js',\n  function: 'exports.debug'\n}"
2023-12-07T14:47:24.964Z | debug    | Running node "When clicking "Execute Workflow"" started "{\n  node: 'When clicking \"Execute Workflow\"',\n  workflowId: 'bjGeTJmgkCB2B45l',\n  file: 'LoggerProxy.js',\n  function: 'exports.debug'\n}"
2023-12-07T14:47:24.964Z | debug    | Running node "When clicking "Execute Workflow"" finished successfully "{\n  node: 'When clicking \"Execute Workflow\"',\n  workflowId: 'bjGeTJmgkCB2B45l',\n  file: 'LoggerProxy.js',\n  function: 'exports.debug'\n}"
2023-12-07T14:47:24.965Z | debug    | Start processing node "source" "{\n  node: 'source',\n  workflowId: 'bjGeTJmgkCB2B45l',\n  file: 'LoggerProxy.js',\n  function: 'exports.debug'\n}"
2023-12-07T14:47:24.966Z | debug    | Running node "source" started "{\n  node: 'source',\n  workflowId: 'bjGeTJmgkCB2B45l',\n  file: 'LoggerProxy.js',\n  function: 'exports.debug'\n}"
2023-12-07T14:47:24.967Z | debug    | Running node "source" finished successfully "{\n  node: 'source',\n  workflowId: 'bjGeTJmgkCB2B45l',\n  file: 'LoggerProxy.js',\n  function: 'exports.debug'\n}"
2023-12-07T14:47:24.968Z | debug    | Start processing node "HTTP Request" "{\n  node: 'HTTP Request',\n  workflowId: 'bjGeTJmgkCB2B45l',\n  file: 'LoggerProxy.js',\n  function: 'exports.debug'\n}"
2023-12-07T14:47:24.969Z | debug    | Running node "HTTP Request" started "{\n  node: 'HTTP Request',\n  workflowId: 'bjGeTJmgkCB2B45l',\n  file: 'LoggerProxy.js',\n  function: 'exports.debug'\n}"
2023-12-07T14:47:24.975Z | debug    | Proxying request to axios "{ file: 'LoggerProxy.js', function: 'exports.debug' }"
2023-12-07T14:47:25.198Z | debug    | Running node "HTTP Request" finished successfully "{\n  node: 'HTTP Request',\n  workflowId: 'bjGeTJmgkCB2B45l',\n  file: 'LoggerProxy.js',\n  function: 'exports.debug'\n}"
2023-12-07T14:47:25.201Z | debug    | Start processing node "read_article" "{\n  node: 'read_article',\n  workflowId: 'bjGeTJmgkCB2B45l',\n  file: 'LoggerProxy.js',\n  function: 'exports.debug'\n}"
2023-12-07T14:47:25.207Z | debug    | Running node "read_article" started "{\n  node: 'read_article',\n  workflowId: 'bjGeTJmgkCB2B45l',\n  file: 'LoggerProxy.js',\n  function: 'exports.debug'\n}"
2023-12-07T14:47:27.160Z | debug    | Running node "read_article" finished successfully "{\n  node: 'read_article',\n  workflowId: 'bjGeTJmgkCB2B45l',\n  file: 'LoggerProxy.js',\n  function: 'exports.debug'\n}"
2023-12-07T14:47:27.163Z | debug    | Start processing node "tech_summary" "{\n  node: 'tech_summary',\n  workflowId: 'bjGeTJmgkCB2B45l',\n  file: 'LoggerProxy.js',\n  function: 'exports.debug'\n}"
2023-12-07T14:47:27.164Z | debug    | Running node "tech_summary" started "{\n  node: 'tech_summary',\n  workflowId: 'bjGeTJmgkCB2B45l',\n  file: 'LoggerProxy.js',\n  function: 'exports.debug'\n}"
2023-12-07T14:47:34.008Z | debug    | Polling trigger initiated for workflow "RecaLinux - Episode - RSS posts omgubuntu" "{\n  workflowName: 'RecaLinux - Episode - RSS posts omgubuntu',\n  workflowId: 'OyDUydZ7i8DPCcp2',\n  file: 'LoggerProxy.js',\n  function: 'exports.debug'\n}"
2023-12-07T14:47:35.932Z | debug    | Running node "tech_summary" finished successfully "{\n  node: 'tech_summary',\n  workflowId: 'bjGeTJmgkCB2B45l',\n  file: 'LoggerProxy.js',\n  function: 'exports.debug'\n}"
Deprecation warning: value provided is not in a recognized RFC2822 or ISO format. moment construction falls back to js Date(), which is not reliable across all browsers and versions. Non RFC2822/ISO date formats are discouraged. Please refer to http://momentjs.com/guides/#/warnings/js-date/ for more info.
Arguments: 
[0] _isAMomentObject: true, _isUTC: false, _useUTC: false, _l: undefined, _i: Thu Dec 07 2023 16:00:00 GMT+0100 (Central European Standard Time), _f: undefined, _strict: undefined, _locale: [object Object]
Error: 
    at Function.createFromInputFallback (/usr/local/lib/node_modules/n8n/node_modules/moment/moment.js:324:25)
    at configFromString (/usr/local/lib/node_modules/n8n/node_modules/moment/moment.js:2550:19)
    at configFromInput (/usr/local/lib/node_modules/n8n/node_modules/moment/moment.js:2993:13)
    at prepareConfig (/usr/local/lib/node_modules/n8n/node_modules/moment/moment.js:2976:13)
    at createFromConfig (/usr/local/lib/node_modules/n8n/node_modules/moment/moment.js:2943:44)
    at createLocalOrUTC (/usr/local/lib/node_modules/n8n/node_modules/moment/moment.js:3037:16)
    at createLocal (/usr/local/lib/node_modules/n8n/node_modules/moment/moment.js:3041:16)
    at hooks (/usr/local/lib/node_modules/n8n/node_modules/moment/moment.js:16:29)
    at Object.execute (/usr/local/lib/node_modules/n8n/node_modules/n8n-nodes-base/dist/nodes/DateTime/V1/DateTimeV1.node.js:409:92)
    at Workflow.runNode (/usr/local/lib/node_modules/n8n/node_modules/n8n-workflow/dist/Workflow.js:672:42)
2023-12-08T11:00:07.317Z [Rudder] debug: no existing flush timer, creating new one
2023-12-08T11:00:27.321Z [Rudder] debug: in flush
2023-12-08T11:00:27.321Z [Rudder] debug: cancelling existing flushTimer...
2023-12-08T11:00:27.321Z [Rudder] debug: batch size is 12
2023-12-08T11:00:47.740Z [Rudder] debug: in flush
2023-12-08T11:00:47.740Z [Rudder] debug: cancelling existing timer...
2023-12-08T11:00:47.740Z [Rudder] debug: queue is empty, nothing to flush
2023-12-08T17:00:07.315Z [Rudder] debug: no existing flush timer, creating new one
2023-12-08T17:00:27.316Z [Rudder] debug: in flush
2023-12-08T17:00:27.316Z [Rudder] debug: cancelling existing flushTimer...
2023-12-08T17:00:27.316Z [Rudder] debug: batch size is 6
2023-12-08T17:00:47.564Z [Rudder] debug: in flush
2023-12-08T17:00:47.564Z [Rudder] debug: cancelling existing timer...
2023-12-08T17:00:47.564Z [Rudder] debug: queue is empty, nothing to flush
2023-12-08T23:00:07.315Z [Rudder] debug: no existing flush timer, creating new one
2023-12-08T23:00:27.316Z [Rudder] debug: in flush
2023-12-08T23:00:27.316Z [Rudder] debug: cancelling existing flushTimer...
2023-12-08T23:00:27.317Z [Rudder] debug: batch size is 6
2023-12-08T23:00:47.560Z [Rudder] debug: in flush
2023-12-08T23:00:47.561Z [Rudder] debug: cancelling existing timer...
2023-12-08T23:00:47.561Z [Rudder] debug: queue is empty, nothing to flush
2023-12-09T05:00:07.314Z [Rudder] debug: no existing flush timer, creating new one
2023-12-09T05:00:27.317Z [Rudder] debug: in flush
2023-12-09T05:00:27.317Z [Rudder] debug: cancelling existing flushTimer...
2023-12-09T05:00:27.317Z [Rudder] debug: batch size is 6
2023-12-09T05:00:47.512Z [Rudder] debug: in flush
2023-12-09T05:00:47.512Z [Rudder] debug: cancelling existing timer...
2023-12-09T05:00:47.512Z [Rudder] debug: queue is empty, nothing to flush
2023-12-09T11:00:07.439Z [Rudder] debug: no existing flush timer, creating new one
2023-12-09T11:00:27.440Z [Rudder] debug: in flush
2023-12-09T11:00:27.441Z [Rudder] debug: cancelling existing flushTimer...
2023-12-09T11:00:27.441Z [Rudder] debug: batch size is 6
2023-12-09T11:00:47.701Z [Rudder] debug: in flush
2023-12-09T11:00:47.701Z [Rudder] debug: cancelling existing timer...
2023-12-09T11:00:47.702Z [Rudder] debug: queue is empty, nothing to flush
2023-12-09T17:00:07.441Z [Rudder] debug: no existing flush timer, creating new one
2023-12-09T17:00:27.442Z [Rudder] debug: in flush
2023-12-09T17:00:27.443Z [Rudder] debug: cancelling existing flushTimer...
2023-12-09T17:00:27.444Z [Rudder] debug: batch size is 6
2023-12-09T17:00:47.649Z [Rudder] debug: in flush
2023-12-09T17:00:47.649Z [Rudder] debug: cancelling existing timer...
2023-12-09T17:00:47.649Z [Rudder] debug: queue is empty, nothing to flush
2023-12-09T23:00:07.441Z [Rudder] debug: no existing flush timer, creating new one
2023-12-09T23:00:27.444Z [Rudder] debug: in flush
2023-12-09T23:00:27.444Z [Rudder] debug: cancelling existing flushTimer...
2023-12-09T23:00:27.444Z [Rudder] debug: batch size is 6
2023-12-09T23:00:47.706Z [Rudder] debug: in flush
2023-12-09T23:00:47.706Z [Rudder] debug: cancelling existing timer...
2023-12-09T23:00:47.707Z [Rudder] debug: queue is empty, nothing to flush
2023-12-10T05:00:07.441Z [Rudder] debug: no existing flush timer, creating new one
2023-12-10T05:00:27.444Z [Rudder] debug: in flush
2023-12-10T05:00:27.444Z [Rudder] debug: cancelling existing flushTimer...
2023-12-10T05:00:27.445Z [Rudder] debug: batch size is 10
2023-12-10T05:00:47.916Z [Rudder] debug: in flush
2023-12-10T05:00:47.916Z [Rudder] debug: cancelling existing timer...
2023-12-10T05:00:47.916Z [Rudder] debug: queue is empty, nothing to flush

I see now a deprecation warning, but I think this shouldn't affect.

Database info

This is the database info related to the running workflows:

psql (11.16 (Debian 11.16-1.pgdg90+1))
Type "help" for help.

n8n=> select * from execution_entity where id = 7617;
  id  | finished |    mode    | retryOf | retrySuccessId |         startedAt          | stoppedAt | waitTill | status |    workflowId    | deletedAt 
------+----------+------------+---------+----------------+----------------------------+-----------+----------+--------+------------------+-----------
 7617 | f        | integrated |         |                | 2023-12-08 07:23:56.043+00 |           |          | new    | kVViRcmElddbWnvh | 
(1 row)

n8n=> select * from execution_entity where id = 7620;
  id  | finished |    mode    | retryOf | retrySuccessId |         startedAt          | stoppedAt | waitTill | status |    workflowId    | deletedAt 
------+----------+------------+---------+----------------+----------------------------+-----------+----------+--------+------------------+-----------
 7620 | f        | integrated |         |                | 2023-12-08 07:24:58.313+00 |           |          | new    | bjGeTJmgkCB2B45l | 
(1 row)

Just to be sure, I've checked on a finished execution:

n8n=> select * from execution_entity where id = 7808;
  id  | finished |    mode    | retryOf | retrySuccessId |         startedAt         |         stoppedAt          | waitTill | status  |    workflowId    | deletedAt 
------+----------+------------+---------+----------------+---------------------------+----------------------------+----------+---------+------------------+-----------
 7808 | t        | integrated |         |                | 2023-12-10 08:00:02.89+00 | 2023-12-10 08:00:14.643+00 |          | success | CN2UvfSE6TqqIgeI | 
(1 row)

Actual Status

By now, the timeout setting in each workflow doesn't seems to work, neither the button in the UI to manually stop each one of them. I saw in the n8n documentation an environment variable to set a max timeout for all workflows, but I'm not using it Any other test to do?

pablorq avatar Dec 10 '23 08:12 pablorq

Hey @pablorq,

Can you share the workflows that are causing the issues so we can test them? Our timeout is likely not kicking in becuase something is blocking it possibly a code node?

Joffcom avatar Dec 11 '23 08:12 Joffcom

Hi @Joffcom !

I do use some code node, most of them in python.

How do you want to share the workflows? Export them and paste? Or how?

pablorq avatar Dec 11 '23 15:12 pablorq

Hey @pablorq,

Export and paste would be ideal.

Joffcom avatar Dec 12 '23 10:12 Joffcom

Detecting the issue origin

I've doing come tests to narrow the problem to a specific workflow. This workflow is "xxx - Create script" and is called by the "xxx - Save list" workflow, which is called by another one. That's why both (or more) workflows gets FRE (Forever Running Execution).

This workflow also uses 2 external Node libraries: @mozilla/readability and jsdom (along with other nodes).

Today I've split that workflow in two different, isolating more the code node that use those libraries from the rest of nodes, to try to define the cause of the problem.

Timeout issue

But among the root cause of this problem, there's another issues.

One of those issues is related to the timeout. Ok, maybe there's something in the code node that keeps a node running. But there's an specific timeout setting in the workflow to avoid the node kept running.

And that timeout isn't working.

Can't cancel issue

Another problem is about to cancel a FRE. There's a convenient "Cancel" button in the execution, but it doesn't work neither.

In summary

At the end, there's 2 emergency measures to avoid the FRE issue, but neither of them works.

This is the last log file: n8n-jsdom-fre-231214-0835.log

Hope it could help to find the root cause of the FRE, but also to solve the timeout and cancel issues.

pablorq avatar Dec 14 '23 14:12 pablorq

Hey @pablorq,

The Timeout workflow option only kicks in once a node has finished so we won't interrupt a currently working workflow this means if you have a node that is running and it is stuck you will run into this and the timeout option won't help. In this case to fix the long running executions we will need to work out which node is causing it.

Joffcom avatar Dec 14 '23 15:12 Joffcom

Hi @Joffcom !

Timeout

Ok. So the timeout is called between nodes. I guess that the timeout environment variable works in the same way. Is it?

If it also works in the same way, I understand that this is not a solution.

Cancel Execution button

It doesn't work neither. I would like to think that this cancel button interrupts the working workflow. Or it doesn't work this way?

If this button is called between nodes, as the timeout, I think it should be reviewed to kill the working workflow.

Not having this option, the only way to "stop" the FRE is restarting the stack/server which is not a good ides in production stage.

Root cause

I'm doing more tests. Right now I'm waiting for the next FRE to reduce the issue to a few nodes.

By the way, did you find something in the logs?

pablorq avatar Dec 15 '23 15:12 pablorq

Root cause Found!

... But I can't believe it!

It's a HTTP Request.

Screenshots: Screenshot 2023-12-19 at 18-40-28 n8n - Executions Screenshot 2023-12-19 at 18-41-47 n8n - ▶️ RecaLinux - Episode - Create script - Get Webpage

This is the workflow: RecaLinux___Episode___Create_script___Get_Webpage.json

Can affect to be a nested sub-worfklow? The parent workflow is RecaLinux - Episode - RSS posts 9to5, which calls to RecaLinux - Episode - Save list, which calls RecaLinux - Episode - Create script. Normally here should end the nested workflows, but for the test I split this last workflow and created RecaLinux - Episode - Get Webpage, so this one is the last one called.

Anyway, with 3 levels of nested workflows, it failed. And also with 4 levels of nesting.

As you can see, there's no special configuration with the HTTP Request, it's just a GET to get the webpage.

Any ideas?

pablorq avatar Dec 19 '23 18:12 pablorq

Yes got the same issue on 1.20.0 in docker queue mode now. Executions sometimes sometimes are showing up running forever. Trying to stop them results in:

**Problem stopping execution**
Could not stop job because it is no longer in queue.

prononext avatar Dec 21 '23 11:12 prononext

I was able to manually reproduce the issue, during some minutes.

It seems that the server didn't respond to the HTTP Request node and then it simply keeps waiting for a response.

Executions page: Screenshot 2023-12-27 at 07-54-45 n8n - Executions

HTTP node: Screenshot 2023-12-27 at 07-32-10 n8n - 🔄 Test rss trigger

Installed curl inside the container and tested this:

curl -vvv -X GET https://9to5linux.com/firefox-122-enters-public-beta-testing-with-improved-built-in-translation-feature

Yes, the -X GET is not necessary and it's the default for curl, but is to "force" it to act like the HTTP node. And the command works well.

After some minutes of testing, the HTTP node started to work. This means that the pending executions keeps running but the new node executions gets results from the server.

As the curl command is not installed by default, I guess n8n is using another way to send HTTP commands, an maybe that's the problem.

Anyway, having some kind of watchdog in case the server never respond could be a good idea.

pablorq avatar Dec 27 '23 07:12 pablorq

Oops!

Find the timeout option in the HTTP node. I've added a timeout of 2min to see if it helps!

pablorq avatar Dec 27 '23 09:12 pablorq

Well, I'm afraid the timeout option in the HTTP node doesn't work:

Screenshot 2023-12-28 at 16-35-58 n8n - Executions

pablorq avatar Jan 02 '24 15:01 pablorq

Hey @pablorq,

Can you share a minimal workflow that reproduces the issue with the http request node? I have just tested calling the 9to5linux url and it is working with no issues.

You are correct on one of your earlier comments we do not use curl we use a nodejs package called Axios for all of our http connections. I wonder if for some reason we have not implemented the timeout properly in one of the http request node versions 🤔

I know we did very recently (last couple of versions) upgrade Axios but that was after 1.20.0 I think. This is the workflow I am running on 1.25.0 which seems to be working as expected.

image

With the cancel button it depends on which node the workflow is running, But typically any stop event happens during the checks between nodes.

Joffcom avatar Jan 18 '24 15:01 Joffcom

Hi @Joffcom !

I have pending to report about this issue.

A couple of weeks ago I've updated to v1.21.1. Since that, I've no more problems with FRE.

It could be the Axios package that you've updated, so for now it seems to be fixed.

Thank you!

pablorq avatar Jan 26 '24 07:01 pablorq