librascal User interface and better solvers for GAP fitting

This PR aims to improve the user interface of the potential-fitting step, starting with fitting of SOAP-GAP potentials. In particular, it provides an interface to automate many of the common tasks involved in GAP fitting, allowing non-interactive (batch-style) fitting based on a flexible Python interface and easing the transition from QUIP.

Edit 24.06.2021: Incorporate the RKHS solver from #354 - description below:

Implement an RKHS solver for the sparse GPR problem.

Solving the GPR problem using the "normal" equation is very ill-conditioned. This implements an alternative solver that is much better behaved. Might be good to also include different options in terms of using solve or lstsq, let's see where this goes - however this already works.

Main changes:

Simplify the train_gap_model function (now gaptools.fit_gap_simple()):
- Improve modularity, remove dependence on StructureManagers
- Move kernel computation out of main function
- Make regularizers more predictable by removing implicit variance normalization (the user should almost always specify the variance / function scale explicitly)
Add a script that reads a parameter file to control fit options
- Example parameter file included
Move the solver of the sparse GPR "normal" equations out of the fitting functions
- Add RKHS and QR solvers as options of the solver class

In progress or not yet implemented (wishlist):

Feature sparsification
Training with virials
Automatic variance scaling if requested

(Probably for future PRs:)

Multi-kernel (multi-SOAP) fitting
Automate uncertainty estimation (fitting with different data subsets)
Interface to automate regularizer optimization (including saving kernels)

Tasks before review:

[ ] Add tests, examples, and complete documentation of any added functionality
- [ ] Make sure the autogenerated documentation appears in Sphinx and is formatted correctly (ask @max-veit if you need help with this task).
- [ ] Write additional documentation in the Sphinx (not docstrings!) to explain the feature and its usage in plain English
- [ ] Make sure the examples run (!) and produce the expected output
[x] Run make lint on the project, ensure it passes
- [ ] Run make pretty-cpp and make pretty-python, check that the auto-formatting is sensible
- [ ] Review variable names, make sure they're descriptive (ask a friend)
- [ ] Review all TODOs, convert to issues wherever possible
- [ ] Make sure draft, in-progress, and commented-out code is moved to its own branch
[ ] Merge with master and resolve any conflicts

Dec 02 '20 21:12 max-veit

A few comments/questions to this draft:

I feel it would be beneficial to separate file saving and computation functionality, e.g. calculate_and_sparsify.
Does the WORKDIR parameter work as intended atm ?
Do you plan to integrate training with virials into gaptools ?
Do you plan on handling partial reference, e.g. some structures with energy but no forces and vice versa ?
Do you plan to handle computation of the kernel on multiple processor ?
Is this the right place to get an ase.Atoms sanitizer ? (The full periodic and no pbc cases are already around so only partial periodicity really needs to be implemented)
Putting parameters in a json (or yml) file is the right way I think.
Do you plan to integrate a cross validation in the executable ?
Does the dedicated ipi interface belong to this PR ?

I guess many of these suggestions can be offloaded to the wishlist of the next PR too.

Dec 07 '20 10:12 felixmusil

One at a time:

I feel it would be beneficial to separate file saving and computation functionality, e.g. `calculate_and_sparsify`.

Sure, although some of that was intended as diagnostic information or just to have a few "save points" in case the fit fails e.g. due to lack of time or memory. If you're referring specifically to this line: https://github.com/cosmo-epfl/librascal/blob/9ae737649005fb1773a431b73db452060f8169cf/bindings/rascal/models/gaptools.py#L98 then yes, I suppose as long as that's included and easily accessible in the final model, then writing the sparse points is no longer necessary. But some of these files (e.g. writing out kernel files) will be useful for planned functionality, like regularizer optimization. Let's discuss this further once the PR is closer to being ready.

Does the WORKDIR parameter work as intended atm ?

It's read by the fit_gap.py script from the json parameter file, not from environment variables (although I guess that could also be made a possibility).

Do you plan to integrate training with virials into gaptools ?

Yes, I'll just go ahead and add that to the wish list.

Do you plan on handling partial reference, e.g. some structures with energy but no forces and vice versa ?

I suppose so, though I don't quite see the use case - usually you'll want to fit on a dataset where everything is computed at the same level of theory, and usually that also means having all structures with the same type of data (e.g. all with forces or all without). In any case, I don't think it would be too hard to implement.

Do you plan to handle computation of the kernel on multiple processor ?

That may indeed be useful, but I think it'll have to wait for another PR.

Is this the right place to get an ase.Atoms sanitizer ? (The full periodic and no pbc cases are already around so only partial periodicity really needs to be implemented)

Not really, but I don't see an easy way to automatically turn on atoms wrapping (which I think should be the default option tbh), so it was easier at the time to put in 5 lines of Python and just do it myself. For the non-periodic case we can use the rascal.neighbourlist.structure_manager.sanitize_non_periodic_structure function.

Putting parameters in a json (or yml) file is the right way I think.

Yep, this was deliberately done for compatibility with the rest of librascal, and especially the representation parameters are very clear when stored in this format.

Do you plan to integrate a cross validation in the executable ?

This would be part of the regularizer optimization, which might be better off in its own PR.

Does the dedicated ipi interface belong to this PR ?

That was a mistake, development has been moved to its own branch.

Dec 07 '20 11:12 max-veit

Could we add this small change https://github.com/cosmo-epfl/librascal/pull/367/files#diff-ee5297016f816d41032a665df5ca536e26f6eabae71f698a1f45036ba12342ca for better compatibility with the cpp side. So the model json can be also loaded very simply from the cpp side.

Jul 01 '21 15:07 agoscinski

I have no objection, can you cherry-pick this into this branch?

On Thu, 1 Jul 2021 at 17:45, agoscinski @.***> wrote:

Could we add this small change

https://github.com/cosmo-epfl/librascal/pull/367/files#diff-ee5297016f816d41032a665df5ca536e26f6eabae71f698a1f45036ba12342ca for better compatibility with the cpp side. So the model json can be also loaded very simple from the cpp side.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cosmo-epfl/librascal/pull/305#issuecomment-872353570, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIREZ73GTUDKVQPPL2OU4DTVSESZANCNFSM4ULBFUNQ .

Jul 01 '21 15:07 ceriottm

Would just add these 4 lines in this branch, seems the simplest solution. The lammps integration PR will be very likely thrown away and split into smaller PRs as soon it starts to work.

I don't want to mess in your work, so I would let one of you do it.

Jul 02 '21 12:07 agoscinski

Which four lines? I think it's better if you add them because I don't understand what they do, while it seems you do Michele

On Fri, 2 Jul 2021 at 14:42, agoscinski @.***> wrote:

Would just add these 4 lines in this branch, seems the simplest solution. The lammps integration PR will be very likely thrown away and split into smaller PRs as soon it starts to work.

I don't want to mess in your work, so I would let one of you do it.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cosmo-epfl/librascal/pull/305#issuecomment-872968612, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIREZ6SOS6KFXXFXUIKJLDTVWX3HANCNFSM4ULBFUNQ .

Jul 02 '21 18:07 ceriottm

Would just add these 4 lines in this branch, seems the simplest solution. The lammps integration PR will be very likely thrown away and split into smaller PRs as soon it starts to work.

I'm not sure I understand either; it seems you'd need all of examples/cpp/krr_model.cc to be able to read JSON model files in C++.

And don't forget, this is the PR where we're essentially defining the model file format, so this would be a good place to make sure they're consistent between C++ and Python.

I don't want to mess in your work, so I would let one of you do it.

Don't worry about that; just add what needs to be added to make it work and we'll sort out the rest later.

Jul 05 '21 14:07 max-veit

I'm not sure I understand either; it seems you'd need all of examples/cpp/krr_model.cc to be able to read JSON model files in C++.

Yes the model is loaded here https://github.com/cosmo-epfl/librascal/pull/367/files#diff-b7526296cb17ab151ab257661e0811b2b8143cbb94006ec8a9b661892bda66a7R70-R101

Luckily all classes have already constructors using only a json as input, so I just need to select the part which is relevant to the class. However, for our python classes have different signatures/declarations than the cpp classes, so I change the the _get_data and _set_data functions to store the cpp objects signature instead of python signature. Here you can see the effective difference in model json format https://github.com/cosmo-epfl/librascal/pull/367/files#diff-b7a2aa979a5e653796cbe03268f9c76c9fec81f00d01d53fd5d431356ce9268b

I have also added the same changes as in the representation python classes for the python kernel class, I did not yet needed it in the lammps branch, because the cpp and python signature are very close to each other, but I think it should be added to all files which have a corresponding cpp class.

Jul 06 '21 10:07 agoscinski

Okay it is not that simple as I initially thought. I think the easiest solution would be to add a new python constructor to the representation classes and kernel to allow initialization with cpp inputs, then we change _get_init_params to outputting the cpp inputs which are already stored in the self.hypers dict.

The cleanest solution would be to make the constructor of the python and cpp classes accepting the same json/dict. Since the python signature is much cleaner and also used by people outside of this project, so they would not need to change their code. This would mean to adapt some cpp code. Mostly annoying because of all the test data which needs to be changed, but I don't see anything that would prevent doing this (for example input hypers on the python side which do not exist on the cpp side). I'll come back on this next week.

Jul 06 '21 16:07 agoscinski

So I'd be OK if we could set it to be an actual zero - having the baseline output a vector of zeros with the appropriate shape seems to be a real pain, which is what triggered this commit - per atom training was breaking with structures of varying size when using the baseline.

On Mon, 12 Jul 2021 at 15:21, Max Veit @.***> wrote:

@.**** commented on this pull request.

In bindings/rascal/models/gaptools.py https://github.com/cosmo-epfl/librascal/pull/305#discussion_r667924352:

@@ -244,7 +244,7 @@ def fit_gap_simple( energies, kernel_energies_sparse, energy_regularizer_peratom,

energy_atom_contributions,

energy_atom_contributions=None,

That would also avoid a lot of the extra logic implemented further down here.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cosmo-epfl/librascal/pull/305#discussion_r667924352, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIREZ6JBH4RG63G4YCAM6DTXLT37ANCNFSM4ULBFUNQ .

Jul 12 '21 14:07 ceriottm

Ok, that's weird, what was the actual error you got? Was it coming from gaptools._get_energy_baseline() or from somewhere in the KRR class?

It seems the current KRR class is very much designed only for potential energy surfaces, i.e. predicting sum values (energies) and their gradients (forces). I would prefer to split it off into two subclasses, one KRRPES or something like that, and one that just handles generic (per-atom or per-structure) properties.

Jul 13 '21 14:07 max-veit

@ceriottm instead of messing with the notebooks when they really haven't changed in content, perhaps you could instead just do a git restore <notebook>.ipynb? It would also give a much cleaner commit history...

Aug 20 '21 16:08 max-veit

Ok, this is almost ready for review -- it has all the functionality I was planning to add here, now just needs some testing.

The main major change since the last time this was updated was to add a Kernel class that works with feature matrices explicitly, rather than rascal StructureManagers. This is of course less efficient than using the native rascal Kernel class, but it also means we can work with representations that aren't currently computed in librascal (like LODE).

Jul 11 '22 09:07 max-veit

librascal librascal copied to clipboard

User interface and better solvers for GAP fitting

@.**** commented on this pull request.

librascal
librascal copied to clipboard