tracker
tracker copied to clipboard
Canvas - issues sending emails - task runners crashing with "out of memory" errors when there is plenty of free memory
Update: Proposed "fix" commented out as it does not seem to make any difference?! My colleague assures me that it was working (at least in part) however my testing suggests that it makes no difference.
Our latest v18.x Canvas has a number of known issues that we're working to resolve. AFAICT all the issues we are investigating are all related to the background task runners crashing.
The issues that have been reproduced related to background task runners crashing are:
- emails not sending
- unable to upload user profile avatar
- unable to apply custom/updated themes
Other issues that have not been directly confirmed but appear to be related are:
- #1977
- #1978
As some background to the apparent cause of the issue; when any action is triggered in Canvas (e.g. sending an email, uploading a file and most changes made in the UI) the action is added to a background queue. When operating correctly, a background service initiates a task runner process to action the next job on the background job queue.
In our current Canvas release, the background service is running ok, but the individual task runners are crashing and not completing their tasks, leaving the jobs in the queue. The task runners die with an error message to the effect of "out of memory" - when there is plenty of free system memory.
We thought that we had developed a solution which seemed to resolve the issues that had been confirmed, e.g. emails starting being sent. However after testing the "fix" on multiple servers over numerous reboots, it became clear that it just changed the nature of the front end error(s) and just reduced the incidence of the task runner crashes. It didn't actually stop them occurring altogether. Intermittent task runner crashes (with the same memory error message) were still occurring.
systemctl stop canvas_init
systemctl stop apache2
cd /var/www/canvas
RAILS_ENV=production bundle exec rake switchman_inst_jobs:install:migrations
systemctl start canvas_init
systemctl start apache2
The issue (at least after the "fix" has been applied) is intermittent for at least some cases and appears to be some sort of race condition. Unfortunately because the issue is intermittent and the only specific error message I've seen seems to be a red herring, it's particularly difficult to isolate the cause.
I have asked one of my colleagues to investigate the issue further, but so far we have had no progress. I plan to rebuild our Canvas server from scratch and carefully document the issue on a fresh server ASAP. After confirming that it is nothing we're overlooking on our end, I will lodge a bug report upstream.