orchestrator-core icon indicating copy to clipboard operation
orchestrator-core copied to clipboard

[Bug]: WFO Orchestrator Scheduler will not trigger jobs as Scheduled

Open Val-HEAnet opened this issue 2 months ago • 12 comments

Contact Details

[email protected]

What happened?

Note: may be related to #1119 Python Version (not available in the dropdown on this bug report template): 3:13.9

Module: WFO Scheduler

Expected behaviour: Scheduled jobs should run at the configured times -

@scheduler(name="Nightly Node Subscriptions Validator", time_unit="day", at="00:01") def run_nightly_node_validation() -> None: ...

Observed behaviour: Jobs fail to run, instead presenting a "job store" error: Error getting due jobs from job store 'default'. Log snippet below. Full scheduler log attached to this report:

wfo.scheduler.log

Version

Orchestrator Core 4.4.2 (UI 5.3.3)

What python version are you seeing the problem on?

All

Relevant log output

2025-10-21 00:01:00 [debug    ] Looking for jobs to run        [apscheduler.scheduler]
2025-10-21 00:01:00 [warning  ] Error getting due jobs from job store 'default': (psycopg2.OperationalError) server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

[SQL: SELECT apscheduler_jobs.id, apscheduler_jobs.job_state
FROM apscheduler_jobs
WHERE apscheduler_jobs.next_run_time <= %(next_run_time_1)s ORDER BY apscheduler_jobs.next_run_time]
[parameters: {'next_run_time_1': 1761004860.000935}]
(Background on this error at: https://sqlalche.me/e/20/e3q8) [apscheduler.scheduler]

ALSO:

2025-10-21 00:06:20 [debug    ] Looking for jobs to run        [apscheduler.scheduler]
2025-10-21 00:06:20 [info     ] Running job "Resume workflows (trigger: interval[1:00:00], next run at: 2025-10-21 01:06:20 UTC)" (scheduled at 2025-10-21 00:06:20.600251+00:00) [apscheduler.executors.default]
2025-10-21 00:06:20 [debug    ] Post form                      [pydantic_forms.core.sync] state={'process_id': UUID('db81090d-77c2-4dce-9456-05b9558c0618'), 'reporter': 'SYSTEM', 'workflow_name': 'task_resume_workflows', 'workflow_target': <Target.SYSTEM: 'SYSTEM'>} user_inputs=[{}]
2025-10-21 00:06:20 [error    ] Job "Resume workflows (trigger: interval[1:00:00], next run at: 2025-10-21 01:06:20 UTC)" raised an exception [apscheduler.executors.default]
Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/apscheduler/executors/base.py", line 131, in run_job
    retval = job.func(*job.args, **job.kwargs)
  File "/usr/local/lib/python3.13/site-packages/orchestrator/schedules/resume_workflows.py", line 21, in run_resume_workflows
    start_process("task_resume_workflows")
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/orchestrator/services/processes.py", line 495, in start_process
    pstat = create_process(workflow_key, user_inputs=user_inputs, user=user)
  File "/usr/local/lib/python3.13/site-packages/orchestrator/services/processes.py", line 470, in create_process
    _db_create_process(pstat)
    ~~~~~~~~~~~~~~~~~~^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/orchestrator/services/processes.py", line 107, in _db_create_process
    wf_table = get_workflow_by_name(stat.workflow.name)
  File "/usr/local/lib/python3.13/site-packages/orchestrator/services/workflows.py", line 64, in get_workflow_by_name
    return db.session.scalar(select(WorkflowTable).where(WorkflowTable.name == workflow_name))
           ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/sqlalchemy/orm/session.py", line 2413, in scalar
    return self._execute_internal(
           ~~~~~~~~~~~~~~~~~~~~~~^
        statement,
        ^^^^^^^^^^
    ...<4 lines>...
        **kw,
        ^^^^^
    )
    ^
  File "/usr/local/lib/python3.13/site-packages/sqlalchemy/orm/session.py", line 2251, in _execute_internal
    result: Result[Any] = compile_state_cls.orm_execute_statement(
                          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        self,
        ^^^^^
    ...<4 lines>...
        conn,
        ^^^^^
    )
    ^
  File "/usr/local/lib/python3.13/site-packages/sqlalchemy/orm/context.py", line 306, in orm_execute_statement
    result = conn.execute(
        statement, params or {}, execution_options=execution_options
    )
  File "/usr/local/lib/python3.13/site-packages/sqlalchemy/engine/base.py", line 1415, in execute
    return meth(
        self,
        distilled_parameters,
        execution_options or NO_OPTIONS,
    )
  File "/usr/local/lib/python3.13/site-packages/sqlalchemy/sql/elements.py", line 523, in _execute_on_connection
    return connection._execute_clauseelement(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        self, distilled_params, execution_options
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/usr/local/lib/python3.13/site-packages/sqlalchemy/engine/base.py", line 1637, in _execute_clauseelement
    ret = self._execute_context(
        dialect,
    ...<8 lines>...
        cache_hit=cache_hit,
    )
  File "/usr/local/lib/python3.13/site-packages/sqlalchemy/engine/base.py", line 1809, in _execute_context
    conn = self._revalidate_connection()
  File "/usr/local/lib/python3.13/site-packages/sqlalchemy/engine/base.py", line 675, in _revalidate_connection
    self._invalid_transaction()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/usr/local/lib/python3.13/site-packages/sqlalchemy/engine/base.py", line 665, in _invalid_transaction
    raise exc.PendingRollbackError(
    ...<4 lines>...
    )
sqlalchemy.exc.PendingRollbackError: Can't reconnect until invalid transaction is rolled back.  Please rollback() fully before proceeding (Background on this error at: https://sqlalche.me/e/20/8s2b)

Val-HEAnet avatar Oct 21 '25 09:10 Val-HEAnet

Image

main.py schedule show-schedule output shows correct values for jobs.

Val-HEAnet avatar Oct 21 '25 10:10 Val-HEAnet

Image

Values for job-times as they exist in the postgres db. The tasks we have configured (those named "run_...") have next_run_times configured (a) in the past, and (b) at times which do not relate to those accurately listed by scheduler show-schedule as shown in the previous comment.

Val-HEAnet avatar Oct 21 '25 10:10 Val-HEAnet

@tjeerddie Could you maybe have a quick look at this?

hanstrompert avatar Oct 21 '25 14:10 hanstrompert

not sure why the server closes immediately...

with the fix for show-schedule, I also changed the store to use orchestrator.db instead of using DB url, which might fix this

tjeerddie avatar Oct 23 '25 07:10 tjeerddie

I checked the dates between the database and main.py schedule show-schedule, and they match. for example nightly_port:

  • show schedules: 2025-10-22 00:30:00+00:00
  • DB: 1761093000, formats to 2025-10-22 00:30:00+00:00

When the scheduler starts, it detects scheduled task times that are in the past and updates them before it waits for the next scheduled task to run

  • If the DB session is closed unexpectedly like I see in your logs, it can’t update those schedules, so they remain in the past.

hope this is some useful info and could you try pre release 4.6.0rc2 or above to see if the problem persists?

tjeerddie avatar Oct 27 '25 14:10 tjeerddie

Hi @tjeerddie - sorry been sidetracked on other things here. Hitherto I've been assuming that database time to be UNIX time in UTC, i.e. the number of Seconds since January 1, 1970 at midnight UTC. I'm thinking now that this may not be the case at all - could you clarify precisely what that timestamp in the DB is?

UPDATE: yes it is UNIX time, sorry!

Val-HEAnet avatar Oct 29 '25 15:10 Val-HEAnet

Image WFO Scheduler Container log output. Latest attempt, same problem I'm afraid.

main.py schedule show-schedule output for the job in question is correct: [run_pyshiba_internal_integrity_check] Next run: 2025-10-29 16:00:00+00:00 | Trigger: cron[hour='16', minute='0']

An examination of the database (id, next_run_time columns for this job's row in apscheduler_jobs) shows: run_pyshiba_internal_integrity_check | 1761753600

which seems to tally. I'll examine the postgres logs here to see if I can find anything.

Val-HEAnet avatar Oct 29 '25 16:10 Val-HEAnet

Strangely, the job I had scheduled for exactly 5 minutes AFTER the failed 4pm UTC job mentioned above - was processed successfully. Bearing that in mind, given this kind of inconsistent behaviour, I am wondering if the PostGres warning seen towards the top of the Scheduler log output in the screenshot above, and pasted here from Postgres's own logging, may be a pointer:

2025-10-29 16:41:59.829 UTC [9997] WARNING: database "orchestrator" has a collation version mismatch 2025-10-29 16:41:59.829 UTC [9997] DETAIL: The database was created using collation version 2.36, but the operating system provides version 2.41. 2025-10-29 16:41:59.829 UTC [9997] HINT: Rebuild all objects in this database that use the default collation and run ALTER DATABASE orchestrator REFRESH COLLATION VERSION, or build PostgreSQL with the right library version.

Val-HEAnet avatar Oct 29 '25 16:10 Val-HEAnet

As far as I know, collation has to do with the sort order of characters based on their value, and has nothing to do with how time/date values are presented.

hanstrompert avatar Nov 03 '25 13:11 hanstrompert

Which postgres version are you using?

Have you tried to resolve the collation error?

Did you check if you're running into the maximum number of database connections?

Mark90 avatar Nov 03 '25 13:11 Mark90

Sorry been diverted onto other matters here.

root@7d5970f2da94:/# psql --version psql (PostgreSQL) 17.6 (Debian 17.6-2.pgdg13+1) root@7d5970f2da94:/# postgres -V postgres (PostgreSQL) 17.6 (Debian 17.6-2.pgdg13+1)

100 max connections set on the running container which should surely be enough? We'll increase to test in any case.

Val-HEAnet avatar Nov 07 '25 11:11 Val-HEAnet

Our Postgres database connection status right now: 14 active connections as against 100 maximum connections allowed:

orchestrator=> show max_connections;
max_connections
-----------------
100
(1 row)
 
orchestrator=> SELECT COUNT(*) FROM pg_stat_activity;
count
-------
    14
(1 row)

Val-HEAnet avatar Nov 07 '25 12:11 Val-HEAnet

The original bug in this issue was fixed, and we cannot reproduce the problem you're encountering of the scheduler shutting down unexpectedly.

We recommend you address the reported collation mismatch between postgres and operating system and see if the problem persists.

If it does, please try to spin up the example-orchestratrator stack (using this branch if it is not merged) and try to run the scheduler there - as an example of how it is supposed to work. Perhaps you can then compare the setup to your own and find out what's causing this. You can always create a new Question post on Discord for more help.

Mark90 avatar Dec 01 '25 15:12 Mark90