quacc icon indicating copy to clipboard operation
quacc copied to clipboard

[Proposal]: Better handling of MD-type jobs where a termination may not be a failure

Open tomdemeyere opened this issue 1 year ago • 1 comments

What new feature would you like to see?

This proposal aim to extend the work of the terminate() function by making it call a new failed_schema() function. Ideally this function should attempt to fetch the current available results. For example in the case of a timed-out MD, the traj and log should be read and put in a dictionary in the same way that this is already done.

The problem, as mentioned by @Andrew-S-Rosen is that:

To do what you're suggesting, we would need to try to prepare a schema, write it to disk, and then terminate. Output files will also not always be able to be parsed (e.g. if the calculator crashes instantly), and this would cause the schema generation to crash.

Indeed such function would need to be full or try/exception as no assumption is made on the current state of the calculation. My interpretation is that such function should not attempt to summarise results (no call to pymatgen etc...) but to barely read what is available: the calculation is not done. In the case of logfile and trajfile that's easy, the files are known. In the case of software specific files, it would be nice to come up with a solution to attempt to read them, for example by using #2407.

From the discussions in #2399

tomdemeyere avatar Aug 10 '24 09:08 tomdemeyere

This is certainly doable. That said, I would propose that this behavior is toggleable via the global settings, such as SETTINGS.STORE_FAILED_JOBS: bool = False. There are two reasons for this: 1) cloud databases like MongoDB are often limited on space, and storing failed calculations may not be desirable; 2) storing the failed jobs to disk and/or database would be a fairly notable breaking change since anyone querying their database for calculation results would now have to add an additional query that only selects "successful" calculations. Silently storing failed outputs in the database will cause downstream problems for people, so this would be an opt-in feature.

It does not seem terribly difficult to implement. The idea would basically be to use a more flexible version of quacc.schemas.ase.Summarize.run to parse the code's main log file along with the input Atoms and calculator parameters. We would also need to store the job state for all jobs (success or failure) so this can be queried. If the parse is unsuccessful (say it fails immediately due to some weird input parameters), then there is not much to store other than the input Atoms and the calculator parameters along with the job state.

Andrew-S-Rosen avatar Aug 10 '24 15:08 Andrew-S-Rosen