civet Handle bad clients

So this is straight out of "fault tolerant systems" issues. It would be really nice to handle "faulty clients" at the server level. One one of our clients breaks in a way that it's still picking up tests but failing them immediately due to problems with keys or the file system. They can wreak havoc on the dash board. A single client can be handed several jobs that they mark as failed (incorrectly). Solving this robustly means taking a vote from multiple clients (when this happens) so that the server can determine that there is a faulty client taking jobs but not actually performing the work, essentially "disabling" or not feeding that client any more work.

Assuming we don't have malicious clients, we could probably take a less rigorous approach to routing out bad actors. Perhaps we slow down eligibility for clients that perform work too fast (e.g. if they fail 2,3 jobs in fast succession, we put them on a cool-down for "awhile"). Of course this won't fix the problem, but it might be a simple improvment. There is no easy path to fixing this problem robustly, but just handing out jobs like we do now is really bad in this situation.

Aug 01 '19 18:08 permcody

Yeah, this would happen occasionally and was pretty annoying. I was not a big fan of keeping state on the server since it would need to be stored in the database and be constantly checked. I was going to go with your assumption that we don't have malicious clients and so they can take care of themselves. My psuedo plan was to mark most of the initial git operations as "No Fail" as that is where most of these sort of problems happen (network, keys, git server down, etc). I think the only git operation that should be a real failure is if the merge fails. The "No Fail" operations would exit with a certain code or touch a file in the filesystem (if the git exit code is wanted). The client would check this and if something marked "No Fail" failed, the client would add that job id to a list of jobs not to ask for, then tell the server to reschedule the job. This list of failed jobs would be only held in the client. The list would probably be specific to each git server. If github goes down it shouldn't affect doing gitlab jobs. If too many are added to the list then it stop doing jobs for that server. Could also have timeouts for each job (ie if it gets a "No Fail" but then starts working again, the initial "No Fail" gets expired).

I thought it would be far better to restart clients that automatically stopped than it is to go through and invalidate a huge number of jobs. This isn't obviously the most robust fix but it would probably be straight forward to implement and ease the annoyance on this relatively rare problem.

Anyway, obviously never got around to implementing this so there might be problems with this approach.

Aug 10 '19 17:08 brianmoose

@brianmoose - Are you looking for a job yet? 😄 We are hiring, yours would come with a decent raise

Aug 13 '19 21:08 permcody

Hah! Not looking as yet, just getting back from an Alaska trip. I will be in Idaho Falls in a week or two if anybody wants to get a :beer:!

Aug 16 '19 17:08 brianmoose

Awesome - I'm sure several of us will go have a beer with you. Looking forward to hearing about your adventures.

On Fri, Aug 16, 2019 at 11:38 AM brianmoose [email protected] wrote:

Hah! Not looking as yet, just getting back from an Alaska trip. I will be in Idaho Falls in a week or two if anybody wants to get a 🍺!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/idaholab/civet/issues/466?email_source=notifications&email_token=AAXFOIA7RK772KJXIL2GNSTQE3QZBA5CNFSM4IITPJKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4PHXYI#issuecomment-522091489, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXFOIA3ZNNMLWQFXKIB543QE3QZBANCNFSM4IITPJKA .

Aug 16 '19 21:08 permcody

civet civet copied to clipboard

Handle bad clients

civet
civet copied to clipboard