[Bug]: WFO Orchestrator Scheduler will not trigger jobs as Scheduled
Contact Details
What happened?
Note: may be related to #1119 Python Version (not available in the dropdown on this bug report template): 3:13.9
Module: WFO Scheduler
Expected behaviour: Scheduled jobs should run at the configured times -
@scheduler(name="Nightly Node Subscriptions Validator", time_unit="day", at="00:01") def run_nightly_node_validation() -> None: ...
Observed behaviour: Jobs fail to run, instead presenting a "job store" error: Error getting due jobs from job store 'default'. Log snippet below. Full scheduler log attached to this report:
Version
Orchestrator Core 4.4.2 (UI 5.3.3)
What python version are you seeing the problem on?
All
Relevant log output
2025-10-21 00:01:00 [debug ] Looking for jobs to run [apscheduler.scheduler]
2025-10-21 00:01:00 [warning ] Error getting due jobs from job store 'default': (psycopg2.OperationalError) server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
[SQL: SELECT apscheduler_jobs.id, apscheduler_jobs.job_state
FROM apscheduler_jobs
WHERE apscheduler_jobs.next_run_time <= %(next_run_time_1)s ORDER BY apscheduler_jobs.next_run_time]
[parameters: {'next_run_time_1': 1761004860.000935}]
(Background on this error at: https://sqlalche.me/e/20/e3q8) [apscheduler.scheduler]
ALSO:
2025-10-21 00:06:20 [debug ] Looking for jobs to run [apscheduler.scheduler]
2025-10-21 00:06:20 [info ] Running job "Resume workflows (trigger: interval[1:00:00], next run at: 2025-10-21 01:06:20 UTC)" (scheduled at 2025-10-21 00:06:20.600251+00:00) [apscheduler.executors.default]
2025-10-21 00:06:20 [debug ] Post form [pydantic_forms.core.sync] state={'process_id': UUID('db81090d-77c2-4dce-9456-05b9558c0618'), 'reporter': 'SYSTEM', 'workflow_name': 'task_resume_workflows', 'workflow_target': <Target.SYSTEM: 'SYSTEM'>} user_inputs=[{}]
2025-10-21 00:06:20 [error ] Job "Resume workflows (trigger: interval[1:00:00], next run at: 2025-10-21 01:06:20 UTC)" raised an exception [apscheduler.executors.default]
Traceback (most recent call last):
File "/usr/local/lib/python3.13/site-packages/apscheduler/executors/base.py", line 131, in run_job
retval = job.func(*job.args, **job.kwargs)
File "/usr/local/lib/python3.13/site-packages/orchestrator/schedules/resume_workflows.py", line 21, in run_resume_workflows
start_process("task_resume_workflows")
~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.13/site-packages/orchestrator/services/processes.py", line 495, in start_process
pstat = create_process(workflow_key, user_inputs=user_inputs, user=user)
File "/usr/local/lib/python3.13/site-packages/orchestrator/services/processes.py", line 470, in create_process
_db_create_process(pstat)
~~~~~~~~~~~~~~~~~~^^^^^^^
File "/usr/local/lib/python3.13/site-packages/orchestrator/services/processes.py", line 107, in _db_create_process
wf_table = get_workflow_by_name(stat.workflow.name)
File "/usr/local/lib/python3.13/site-packages/orchestrator/services/workflows.py", line 64, in get_workflow_by_name
return db.session.scalar(select(WorkflowTable).where(WorkflowTable.name == workflow_name))
~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.13/site-packages/sqlalchemy/orm/session.py", line 2413, in scalar
return self._execute_internal(
~~~~~~~~~~~~~~~~~~~~~~^
statement,
^^^^^^^^^^
...<4 lines>...
**kw,
^^^^^
)
^
File "/usr/local/lib/python3.13/site-packages/sqlalchemy/orm/session.py", line 2251, in _execute_internal
result: Result[Any] = compile_state_cls.orm_execute_statement(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
self,
^^^^^
...<4 lines>...
conn,
^^^^^
)
^
File "/usr/local/lib/python3.13/site-packages/sqlalchemy/orm/context.py", line 306, in orm_execute_statement
result = conn.execute(
statement, params or {}, execution_options=execution_options
)
File "/usr/local/lib/python3.13/site-packages/sqlalchemy/engine/base.py", line 1415, in execute
return meth(
self,
distilled_parameters,
execution_options or NO_OPTIONS,
)
File "/usr/local/lib/python3.13/site-packages/sqlalchemy/sql/elements.py", line 523, in _execute_on_connection
return connection._execute_clauseelement(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
self, distilled_params, execution_options
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/usr/local/lib/python3.13/site-packages/sqlalchemy/engine/base.py", line 1637, in _execute_clauseelement
ret = self._execute_context(
dialect,
...<8 lines>...
cache_hit=cache_hit,
)
File "/usr/local/lib/python3.13/site-packages/sqlalchemy/engine/base.py", line 1809, in _execute_context
conn = self._revalidate_connection()
File "/usr/local/lib/python3.13/site-packages/sqlalchemy/engine/base.py", line 675, in _revalidate_connection
self._invalid_transaction()
~~~~~~~~~~~~~~~~~~~~~~~~~^^
File "/usr/local/lib/python3.13/site-packages/sqlalchemy/engine/base.py", line 665, in _invalid_transaction
raise exc.PendingRollbackError(
...<4 lines>...
)
sqlalchemy.exc.PendingRollbackError: Can't reconnect until invalid transaction is rolled back. Please rollback() fully before proceeding (Background on this error at: https://sqlalche.me/e/20/8s2b)
main.py schedule show-schedule output shows correct values for jobs.
Values for job-times as they exist in the postgres db. The tasks we have configured (those named "run_...") have next_run_times configured (a) in the past, and (b) at times which do not relate to those accurately listed by scheduler show-schedule as shown in the previous comment.
@tjeerddie Could you maybe have a quick look at this?
not sure why the server closes immediately...
with the fix for show-schedule, I also changed the store to use orchestrator.db instead of using DB url, which might fix this
I checked the dates between the database and main.py schedule show-schedule, and they match. for example nightly_port:
- show schedules: 2025-10-22 00:30:00+00:00
- DB: 1761093000, formats to 2025-10-22 00:30:00+00:00
When the scheduler starts, it detects scheduled task times that are in the past and updates them before it waits for the next scheduled task to run
- If the DB session is closed unexpectedly like I see in your logs, it can’t update those schedules, so they remain in the past.
hope this is some useful info and could you try pre release 4.6.0rc2 or above to see if the problem persists?
Hi @tjeerddie - sorry been sidetracked on other things here. Hitherto I've been assuming that database time to be UNIX time in UTC, i.e. the number of Seconds since January 1, 1970 at midnight UTC. I'm thinking now that this may not be the case at all - could you clarify precisely what that timestamp in the DB is?
UPDATE: yes it is UNIX time, sorry!
WFO Scheduler Container log output. Latest attempt, same problem I'm afraid.
main.py schedule show-schedule output for the job in question is correct:
[run_pyshiba_internal_integrity_check] Next run: 2025-10-29 16:00:00+00:00 | Trigger: cron[hour='16', minute='0']
An examination of the database (id, next_run_time columns for this job's row in apscheduler_jobs) shows:
run_pyshiba_internal_integrity_check | 1761753600
which seems to tally. I'll examine the postgres logs here to see if I can find anything.
Strangely, the job I had scheduled for exactly 5 minutes AFTER the failed 4pm UTC job mentioned above - was processed successfully. Bearing that in mind, given this kind of inconsistent behaviour, I am wondering if the PostGres warning seen towards the top of the Scheduler log output in the screenshot above, and pasted here from Postgres's own logging, may be a pointer:
2025-10-29 16:41:59.829 UTC [9997] WARNING: database "orchestrator" has a collation version mismatch 2025-10-29 16:41:59.829 UTC [9997] DETAIL: The database was created using collation version 2.36, but the operating system provides version 2.41. 2025-10-29 16:41:59.829 UTC [9997] HINT: Rebuild all objects in this database that use the default collation and run ALTER DATABASE orchestrator REFRESH COLLATION VERSION, or build PostgreSQL with the right library version.
As far as I know, collation has to do with the sort order of characters based on their value, and has nothing to do with how time/date values are presented.
Which postgres version are you using?
Have you tried to resolve the collation error?
Did you check if you're running into the maximum number of database connections?
Sorry been diverted onto other matters here.
root@7d5970f2da94:/# psql --version psql (PostgreSQL) 17.6 (Debian 17.6-2.pgdg13+1) root@7d5970f2da94:/# postgres -V postgres (PostgreSQL) 17.6 (Debian 17.6-2.pgdg13+1)
100 max connections set on the running container which should surely be enough? We'll increase to test in any case.
Our Postgres database connection status right now: 14 active connections as against 100 maximum connections allowed:
orchestrator=> show max_connections;
max_connections
-----------------
100
(1 row)
orchestrator=> SELECT COUNT(*) FROM pg_stat_activity;
count
-------
14
(1 row)
The original bug in this issue was fixed, and we cannot reproduce the problem you're encountering of the scheduler shutting down unexpectedly.
We recommend you address the reported collation mismatch between postgres and operating system and see if the problem persists.
If it does, please try to spin up the example-orchestratrator stack (using this branch if it is not merged) and try to run the scheduler there - as an example of how it is supposed to work. Perhaps you can then compare the setup to your own and find out what's causing this. You can always create a new Question post on Discord for more help.