server
server copied to clipboard
MDEV-34642 Shutdown take indefinite time when /tmp is full.
- [x] The Jira issue number for this PR is: MDEV-34642
Description
Write operations where the caller requests wait for space will wait indefinately. A server shutdown will kill off these threads however there no exit from the functions waiting for space.
Extend the killing of a thread to set the abort in the mysys_var so that the my_[p,f,]write will abort after the MY_WAIT_FOR_USER_TO_FIX_PANIC (60 seconds) allowing the shutdown to continue.
Release Notes
Shutdown can occur when there are queries waiting for space. These queries are killed off.
How can this PR be tested?
per mtr test.
If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.
Basing the PR against the correct MariaDB version
- [ ] This is a new feature or a refactoring, and the PR is based against the latest MariaDB development branch.
- [X] This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.
PR quality check
- [X] I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
- [X] For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.
Is there a potential for data loss (particularly engines such as Aria)? If we kill off things that are waiting to write data?
I also think mariadb_sleep_for_space() looks buggy. unlike in other cases (e.g SLEEP), it does not set mysys_var->current_mutex and mysys_var->current_cond, to mutex and condition it is waiting on, thus it can't be instantly interrupted by kills.
Monty may have fixed it with MDEV-33813 in 10.6 only, I think the test runs quicker in 10.6.
purpose of abort I think needs rechecking. Yes myisam could corrupt (like test case suppresses), but no more than any other filesystem error.
I do not know why monty's fix works, given my previous comment. mysys_var->mutex and cond are not set. those are things that are getting used in KILL, or shutdown, at least traditionally this was always the case. And in the unlikely case it works, can't we either backport the fix to 10.5, or rebase the test case to 10.6? I'm just checking how long the test case took in the buildbot output, and this happened to be the longest thing on Windows. I would not want to stall buildbot machine with sleeps, tbh :)
Checked again , monty's fix would work, because thd->ENTER_COND would set mysys_var->cond and mutex, but the request to add test where it is fix is available still stands. There are many pushes into 10.5, a test here that does nothing, should not take 1 minute, given that full suite runs in 4 minutes (Windows foundation buildbot numbers)