spiffe-helper icon indicating copy to clipboard operation
spiffe-helper copied to clipboard

Retry sending signals to pid_file_name before logging errors

Open ringerc opened this issue 11 months ago • 4 comments

There is a race where the process pointed to by pid_file_name could be restarted and read the old certificates but not have its new pid written into pid_file_name yet. In this case, we would try to signal the old process and fail, then give up, so the new process would never get its certs refreshed.

On initial startup, unavoidable errors may also be logged when the process to be launched has not yet been started (because its certs don't exist yet) so pid_file_name is missing or invalid. These shouldn't be logged as errors, but we don't really want to suppress all errors for all pid file signalling since that would hide misconfigurations.

Handle both issues by adding a retry-backoff loop for signalling pid_file_name. On failure, messages are logged at Info level until the retry limit is exceeded, then an Error is logged.

During startup the controlling process is expected to be watching for the creation of the certificates, and to launch the controlled process and create the pid file in a timely manner once the certs appear. So it becomes possible to start up without "normal" errors appearing.

For the restart-race case, we will initially fail to signal the stale pid in the pid file, so we will retry until the new pid is written, so the new process that loaded stale certs will be properly signalled to reload.

Fixes: #250, #251

I added an extensive test suite as part of this, covering command execution, signal handling, etc. If you want some of that split out from this PR then this PR rebased on top I can probably do that.

I also fixed Windows builds and ensured the existing and added tests run on Windows - or at least, under wine when cross-compiled

I strongly suggest merging this via fast-forward merge or with a merge commit, not a squash merge. Otherwise the separation of commits will be lost and it'll be a pain to understand in the history.

ringerc avatar Feb 02 '25 23:02 ringerc

Added some test cover for command exec. Will add some for pid signalling next.

ringerc avatar Feb 03 '25 08:02 ringerc

I see some linter complaints here. I'll address them, and probably split the PR.

ringerc avatar Feb 05 '25 00:02 ringerc

Windows fixes split off to simpler individual PR: https://github.com/spiffe/spiffe-helper/pull/257

Added test cover split off to a PR that focuses on adding the test cover, https://github.com/spiffe/spiffe-helper/pull/258

I've rebased this PR onto a merge-commit created from the two listed above and fixed the linter complaints.

Github doesn't seem to have any way to say "this PR depends on these other two", and only appears to understand linear histories, so I'll keep this PR marked as a draft to prevent it being inadvertently squash-merged until the other two are reviewed and merged. Then this will be a trivial git rebase origin/main.

The only commit unique to this PR is https://github.com/spiffe/spiffe-helper/pull/252/commits/cb78cab56c5a12cb6794cd9af5fd5c05f8d2b61b so that's the only one to focus on for review.

ringerc avatar Feb 05 '25 01:02 ringerc

I won't bother rebasing this since the merge conflict is actually in one of the dependent commits that I've already fixed over in https://github.com/spiffe/spiffe-helper/pull/258 . When that gets merged, I can remove the relevant commit from this branch and the merge conflict will go away.

ringerc avatar Feb 25 '25 01:02 ringerc

@ringerc All the pre reqs have been merged. Are you still interesting in finishing off this pr?

faisal-memon avatar May 11 '25 08:05 faisal-memon

@faisal-memon I've run out of time and capacity to update for the conflicts at this point. I'll try to return to if it I can.

ringerc avatar Aug 04 '25 01:08 ringerc

Closing in favor of #333

faisal-memon avatar Nov 04 '25 22:11 faisal-memon