ROBUSTNESS: Are JobCollection objects written atomically?
Vignette 'Migrating from BatchJobs/BatchExperiments' says:
- "Nodes do not have to access the registry. submitJobs() stores a temporary object of type JobCollection on the file system which holds all the information necessary to execute a chunk of jobs via doJobCollection() on the node. This avoids file system locks because each job accesses only one file exclusively."
What happens if master reads such a JobConnection file before it has been fully written? Sure, the chance for this should be small, but with 10,000-100,000's of jobs, could this happen? If it could happen, I wonder if the file format can protect against this, e.g. an generate at least a parse error, or in the worst case, silently parse the incomplete file.
If not already written atomically, could it be achieved by simply writing to foo.rds.tmp which is then renamed to foo.rds only when saveRDS() is complete? (Here I'm just assuming you're using RDS files).
Jobs which fail to find a JobCollection file will terminate with an error but appear as "expired" on the master because they were unable to communicate anything back. The log files should be informative if the template file redirects the output and does not rely on batchtools to do so.
However, there are several mechanism which should guard against this:
-
saveRDSis blocking the session andsubmitJobis called aftersaveRDS. - I use this little helper here to write the files and then wait for the file system / file system caches to find the file.
- The write/rename approach is already implemented for registries where I want to be extra fail-safe and also keep a backup. I can port this to other write ops, too. The only possible downside I see is that this inflicts many more stat system calls. I'll investigate if this is a problem on some network file systems.
Possible that the described mechanism is not sufficient. For some reasons systems are behaving differently than with the result files where a call to list.files() is required. Therefore I would like to stick with the current implementation until problems are reported. Ok with you @HenrikBengtsson ?
Just to be clear, I posted this issue based on the reading and not based on an experienced problem. I must admit that this one slipped as I didn't really had time to dive into the details of your explanation. It sounds like you're saying there's no/little risk for race conditions when it comes to JobCollection objects.
For other RDS files: Since you're using the RDS file format (and compressed too), I think (=hope) that is a good enough protection for one processing reading a half complete RDS file that another process yet hasn't finished writing; I hope it gives a read error. The save to *.rds.tmp and then rename to *.rds was my idea to lower the risk for this even further since the reader would never see a half-baked file (only a full file or not a file at all). If you're using writeRDS() everywhere, it sounds like an easy task to just add there.
But, I fully trust you here, especially since I only know so much about underlying design. Feel free to ignore this issue until it really becomes an issue, if at all.