machine Clean up inconsistent states in MongoDB

MongoDB or the job server can crash at any time. We should have a periodic job to clean it up. This is really about build states getting stuck.

Jan 19 '24 20:01 johnml1135

Here is a plan:

On startup, machine-job checks to see if any jobs say they are "Processing" and gets them up and running again (they may be running on ClearML or need to be restarted, or may be complete but not register it)
A new GRPC endpoint is added where calls Machine:
- Serval -> Machine: Here is the current list of BuildID's that I believe are either active or pending.
- Machine: Check the list against the builds running MongoDB entries
- Machine -> Serval: Here is an array of statuses for each BuildID that you gave me.
- Serval: Update MongoDB with entries from Machine
This GPRC endpoint is invoked:
When Serval-API starts up (and keeps trying every 10 seconds until it is successful)
When a specific build is asked for and the data is over 1 minute old (and has not been pushed to Serval from Machine within a minute either)

Apr 09 '24 17:04 johnml1135

I don't think we can switch all running jobs to pending on Machine startup. For example, a job can be running just fine on ClearML even if Machine restarts. I would really like to avoid adding any gRPC endpoints to deal with this issue. It isn't the responsibility of Serval to deal with the inconsistencies. It should be the sole responsibility of engine. There are many ways that this issue could be dealt with depending on how the engine is implemented. I want to give freedom to the engine to handle the inconsistencies in the best way possible.

Apr 10 '24 14:04 ddaspit

I updated the original description to reference machine-job and to keep jobs running if it can. Only if that restarts should there really be an issue.

As for another gRPC endpoint, Serval may not know that a job is complete (even though machine tried to reach it). As far as machine knows, the job is complete but when a user asks for the status, Serval doesn't check with Machine, but just returns it's incorrect status. Do you have another way of resolving or syncing these two data sources?

Apr 10 '24 19:04 johnml1135

The inconsistency can occur, because, when a job completes, Machine needs to update the database and Serval using gRPC. These two operations need to be atomic, but they aren't. This is a common issue for distributed systems. Luckily, there is a pattern to handle this issue, called the transactional outbox pattern. Basically, we perform the database update and write the gRPC message to a database outbox in a single transaction. There is a separate process that monitors the database outbox and actually sends the message to Serval. This guarantees that the message is sent eventually even if Serval is down.

Apr 15 '24 21:04 ddaspit

Yes, the transactional outbox looks a bit more elegant - I'll work on implementing it then instead of the other GPRC endpoint.

Apr 19 '24 15:04 johnml1135

I thought about using Hangfire a bit more and realized that it won't work. We want to be able to update the outbox and the model in the same transaction. If we use Hangfire, the outbox would be the Hangfire queue, which is stored in a separate database. So there is no way to update the Machine database and the Hangfire database in a single transaction. I think this means that we will need to implement our own transactional outbox.

Apr 19 '24 19:04 ddaspit

Here is a sample implementation of the transactional outbox pattern for .NET.

Apr 19 '24 20:04 ddaspit