n8n
n8n copied to clipboard
Running execusions cannot be stopped
On n8n docker in queue mode version: 1.22.6
I see various execusions which cannot be stopped.
when hitting "Stop Executions" the confirmation is there but the execusions are still running.
Also under the main execusions menu I see several execusions running and when switching to the some of them, the running ones only light up for a second, then dissapear (sadly that I could not document via video as its too fast)
I got this ENVs on my docker stack. Maybe someone could help if they are still ok for a high performance instance with workflows sometimes having to run about 60 minutes more or less:
DB_INIT_FILE=/opt/n8n/init-data.sh
N8N_LOCAL_STORAGE=/local-files
POSTGRESQL_VERSION_TAG=12.13
REDIS_VERSION_TAG=alpine
N8N_VERSION_TAG=1.22.6
N8N_MAIN_COMMAND=start
N8N_WORKER_COMMAND=worker --concurrency=10
N8N_WEBHOOK_COMMAND=webhook
N8N_PORT=5678
N8N_USER_MANAGEMENT_DISABLED=false
N8N_BASIC_AUTH_ACTIVE=true
N8N_DIAGNOSTICS_ENABLED=false
N8N_PERSONALIZATION_ENABLED=false
N8N_HIRING_BANNER_ENABLED=false
N8N_LOG_LEVEL=debug
N8N_DISABLE_PRODUCTION_MAIN_PROCESS=true
EXECUTIONS_MODE=queue
EXECUTIONS_DATA_SAVE_ON_SUCCESS=none
EXECUTIONS_DATA_SAVE_ON_ERROR=all
EXECUTIONS_DATA_PRUNE=true
EXECUTIONS_DATA_MAX_AGE=32
N8N_DEFAULT_BINARY_DATA_MODE=filesystem
N8N_AVAILABLE_BINARY_DATA_MODES=filesystem
DB_TYPE=postgresdb
DB_POSTGRESDB_PORT=5432
DB_POSTGRESDB_HOST=postgres
DB_LOGGING_MAX_EXECUTION_TIME=0
QUEUE_BULL_REDIS_PORT=6379
QUEUE_BULL_REDIS_HOST=redis
QUEUE_HEALTH_CHECK_ACTIVE=true
N8N_GRACEFUL_SHUTDOWN_TIMEOUT=600
QUEUE_WORKER_LOCK_DURATION=180000
QUEUE_WORKER_MAX_STALLED_COUNT=5
QUEUE_RECOVERY_INTERVAL=300
QUEUE_BULL_REDIS_TIMEOUT_THRESHOLD=180000
N8N_SKIP_WEBHOOK_DEREGISTRATION_SHUTDOWN=true
N8N_ENDPOINT_WEBHOOK=prod
NODE_OPTIONS="--max-old-space-size=8000"
NODE_FUNCTION_ALLOW_BUILTIN=*
NODE_FUNCTION_ALLOW_EXTERNAL=*
Hey @prononext,
Do you get the same issue in 1.24.1
?
is it save to update to the next version on a production environment. I have downgraded to 1.22.1 and really strange things are happening like:
- executions appear out of nowhere saying they are running for 10 minutes already
- sub workflows are running simultaniously as before only one was running
- still the indefinately running executions cannot be stopped
running into major problems like this with n8n every 1-2 month is really killing the project for me sadly
Hey @prononext,
The next‘ version will be marked
latest` later today so it should be ok but as with any software that is being used in a production environment I would recommend running a test environment so you can test to make sure your flows don't do anything unexpected.
While we don't set out to break things sadly like with any application the odd issue does slip through.
It sounds like the executions can't be stopped might not be linked to the version but I am fairly sure we fixed something with them recently, I will dig through the release notes to see if I can find anything.
Hey @Joffcom , this might be a regression but I'm encountering the same issue on [email protected] as well. Specifically, when a scheduled workflow is running, and I try to manually stop it, it doesn't actually stop the execution (keeps running). I'm not sure how to debug or troubleshoot this further.
@dkindlund when you press stop do you see an error? Are you also running in queue mode?
Hey @Joffcom , when I press stop, the UI briefly changes the workflow to a "stopped" state, but then during the next auto-refresh, it goes back to "running". I'm not running in queue mode (it's the main/standalone/integrated mode). The only way to fix this issue is by restarting the container altogether.
To be honest, it feels like some sort of DB record conflict, perhaps? Like, when I press "stop" on the workflow execution, I think it first updates the workflow state information inside the PostgreSQL DB and then the thread responsible for running the job is supposed to "check" the state table in the DB -- but it never does -- instead, the thread responsible for running the job ends up just updating the DB entry again.
Hey @dkindlund,
Do you also have this issue when you connect to your n8n instance directly without using any kind of reverse proxy or load balancer? I have just checked our internal install, my cloud instance and my home instance and I am not able to reproduce this.
Hey @Joffcom , that's a good question. I don't have it running locally -- it's deployed as a Google Cloud Run container. It's currently configured to spin up between 1 and 3 instances (autoscaling). Most of the time, a single instance is running. In the network section, I do have Session affinity
checked, so that way the load balancer keeps state.
Google Cloud Run services are just a simplified wrapper on top of k8s, and here's the underlying YAML file that's generated for this deployment:
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: n8n
namespace: 'XXXREDACTEDXXX'
selfLink: /apis/serving.knative.dev/v1/namespaces/XXXREDACTEDXXX/services/n8n
uid: XXXREDACTEDXXX
resourceVersion: XXXREDACTEDXXX
generation: 38
creationTimestamp: '2024-02-09T02:00:07.415038Z'
labels:
owner: darien
managed-by: gcp-cloud-build-deploy-cloud-run
purpose: n8n
gcb-trigger-id: XXXREDACTEDXXX
gcb-trigger-region: global
commit-sha: XXXREDACTEDXXX
gcb-build-id: XXXREDACTEDXXX
cloud.googleapis.com/location: us-west1
annotations:
run.googleapis.com/client-name: cloud-console
serving.knative.dev/creator: [email protected]
serving.knative.dev/lastModifier: [email protected]
run.googleapis.com/launch-stage: BETA
run.googleapis.com/operation-id: XXXREDACTEDXXX
run.googleapis.com/ingress: all
run.googleapis.com/ingress-status: all
spec:
template:
metadata:
labels:
owner: darien
client.knative.dev/nonce: XXXREDACTEDXXX
managed-by: gcp-cloud-build-deploy-cloud-run
purpose: n8n
gcb-trigger-id: XXXREDACTEDXXX
gcb-trigger-region: global
commit-sha: XXXREDACTEDXXX
gcb-build-id: XXXREDACTEDXXX
run.googleapis.com/startupProbeType: Custom
annotations:
run.googleapis.com/client-name: cloud-console
run.googleapis.com/network-interfaces: '[{"network":"default","subnetwork":"default","tags":["dev-n8n"]}]'
run.googleapis.com/sessionAffinity: 'true'
autoscaling.knative.dev/minScale: '1'
run.googleapis.com/vpc-access-egress: private-ranges-only
run.googleapis.com/execution-environment: gen2
autoscaling.knative.dev/maxScale: '3'
run.googleapis.com/startup-cpu-boost: 'true'
spec:
containerConcurrency: 80
timeoutSeconds: 3600
serviceAccountName: XXXREDACTEDXXX
containers:
- name: main
image: us-west1-docker.pkg.dev/XXXREDACTEDXXX/cloud-run-source-deploy/n8n/n8n:XXXREDACTEDXXX
ports:
- name: http1
containerPort: 5678
env:
- name: N8N_VERSION
value: latest
- name: DB_POSTGRESDB_DATABASE
value: dev-n8n-conf
- name: DB_POSTGRESDB_HOST
value: XXXREDACTEDXXX
- name: DB_POSTGRESDB_USER
value: dev-n8n-conf
- name: DB_POSTGRESDB_PORT
value: '5432'
- name: DB_POSTGRESDB_SCHEMA
value: public
- name: DB_POSTGRESDB_SSL_REJECT_UNAUTHORIZED
value: 'false'
- name: DB_POSTGRESDB_SSL_CA
value: XXXREDACTEDXXX
- name: DB_TYPE
value: postgresdb
- name: N8N_USER_FOLDER
value: /opt/n8n
- name: WEBHOOK_URL
value: XXXREDACTEDXXX
- name: GENERIC_TIMEZONE
value: America/Los_Angeles
- name: EXECUTIONS_TIMEOUT
value: '2700'
- name: N8N_EDITOR_BASE_URL
value: XXXREDACTEDXXX
- name: N8N_HOST
value: XXXREDACTEDXXX
- name: N8N_HIRING_BANNER_ENABLED
value: 'false'
- name: N8N_SMTP_HOST
value: XXXREDACTEDXXX
- name: N8N_SMTP_PORT
value: '465'
- name: N8N_SMTP_USER
value: XXXREDACTEDXXX
- name: N8N_SMTP_SENDER
value: XXXREDACTEDXXX
- name: N8N_LOG_LEVEL
value: info
- name: EXECUTIONS_MODE
value: regular
- name: N8N_DISABLE_PRODUCTION_MAIN_PROCESS
value: 'false'
- name: EXECUTIONS_TIMEOUT_MAX
value: '2700'
- name: EXECUTIONS_DATA_PRUNE
value: 'true'
- name: EXECUTIONS_DATA_MAX_AGE
value: '168'
- name: EXECUTIONS_DATA_PRUNE_MAX_COUNT
value: '50000'
- name: NODE_OPTIONS
value: --max-old-space-size=1536
- name: N8N_PUSH_BACKEND
value: websocket
- name: N8N_DEFAULT_BINARY_DATA_MODE
value: filesystem
- name: N8N_ENCRYPTION_KEY
valueFrom:
secretKeyRef:
key: latest
name: dev-n8n_secretkey
- name: DB_POSTGRESDB_PASSWORD
valueFrom:
secretKeyRef:
key: latest
name: threat-intel-context_dev-n8n-conf_password
- name: N8N_SMTP_PASS
valueFrom:
secretKeyRef:
key: latest
name: dev-n8n_mailjet_secretkey
resources:
limits:
cpu: 2000m
memory: 2Gi
volumeMounts:
- name: dev-n8n
mountPath: /opt/n8n
startupProbe:
initialDelaySeconds: 60
timeoutSeconds: 45
periodSeconds: 60
failureThreshold: 10
tcpSocket:
port: 5678
volumes:
- name: dev-n8n
csi:
driver: gcsfuse.run.googleapis.com
volumeAttributes:
bucketName: dev-n8n
traffic:
- percent: 100
latestRevision: true
status:
observedGeneration: 38
conditions:
- type: Ready
status: 'True'
lastTransitionTime: '2024-02-28T19:57:58.756674Z'
- type: ConfigurationsReady
status: 'True'
lastTransitionTime: '2024-02-09T02:00:07.521375Z'
- type: RoutesReady
status: 'True'
lastTransitionTime: '2024-02-28T19:57:58.711992Z'
latestReadyRevisionName: XXXREDACTEDXXX
latestCreatedRevisionName: XXXREDACTEDXXX
traffic:
- revisionName: XXXREDACTEDXXX
percent: 100
latestRevision: true
url: XXXREDACTEDXXX
address:
url: XXXREDACTEDXXX
Hey @Joffcom , it occurred to me that I never got a reply back to my original architecture question posted in the community forum about this issue: https://community.n8n.io/t/n8n-architecture-questions/40375
Specifically, is it possible that EXECUTIONS_MODE=regular
was never designed to support more than one simultaneous instance, and that the problem of "stopping running executions" is actually a symptom of an inadvertent split-brain problem?
Like, if let's say 2 or more n8n instances both running EXECUTIONS_MODE=regular
and talking to the same database... if one instance is running the job... and in the other instance was processing the user's request to "stop" the job in the UI... maybe the code was never designed for this in mind?
Hey @dkindlund,
I missed your reply on this one, You are correct regular
as documented is not intended for multiple instances of n8n so you would use regular
if you have one instance and queue
if you are running in queue mode on all instances.
If you were to have 2 main instances talking to the same database I would expect there to be issues but it would also raise more questions like why was it deployed that way.
We do now support multiple main instances in queue mode but even then it still needs to be be in queue mode.