librascal
librascal copied to clipboard
Computation of kernel between atomic environments takes a long time using rascal.models.kernel
Hello everyone,
I am computing the polynomial kernel between atomic environments using the rascal.models.kernel class
for structure in structures_train:
mask_center_atoms_by_species(structure,species_select=[1])
calculator = SphericalInvariants(**hypers)
atoms_list_train = calculator.transform(structures_train)
kernel = Kernel(calculator,target_type="Atom",zeta=2)
kernel_computed = kernel(atoms_list_train)
According to cProfile this takes about 42 seconds.
However, when I compute the same kernel using the sklearn.metrics.pairwise.polynomial_kernel
kernel implementation:
X_train = calculator.transform(structures_train).get_features(calculator)
Kernel_sklearn = sklearn.metrics.pairwise.polynomial_kernel(X_train, degree=2, gamma=1., coef0=0)
np.allclose(kernel_computed,Kernel_sklearn) #returns True
According to cProfile the calculation of the kernel only takes 2 seconds. Is there something I am missing ?
The transformation of the features takes about 1 second
That's unexpected! Do you have a full script we could use to reproduce this?
Dear bananenpampe,
Thanks for reporting! There are many factors that influence the speed of kernel computation (in librascal and elsewhere), so it would help to have more information on what environment you're running this in and what sort of structures you're testing this on, in particular the atomic species composition. The librascal Kernel
calculator is designed to run quickly especially in cases where you have many different atomic species present in a system, but only a few will be present in any given environment, meaning the feature matrix -- X_train
in your example -- is effectively (block-)sparse. But it should not be any slower than direct matrix multiplication in any case.
A few things to check:
- Have you compiled in
Release
mode? (i.e. notDebug
) - Is it possible that the
sklearn
calculation is running in OpenMP parallel mode? (useexport OMP_NUM_THREADS=1
to be sure.) We have not yet implemented OpenMP parallelism in librascal, though some of the underlying Eigen operations should still be vectorized IIRC - I notice you're using center atom masking; we haven't extensively tested kernels with that. What is the performance without using the masking?
Thanks!
Hello,
Thank you for your fast responses.
I have run cmake without any option, that should be the Release
option, right?
Changing the OMP environment variable and not using masking has not changed the behaviour.
I have included a script that should reproduce the behaviour and the data file that I am using which should be placed in the same directory. reproduce.py.zip . CSD-2k_relaxed_shifts.txt.zip
Hello,
could you try to set "expansion_by_species_method":"structure wise"
in your input for SphericalInvariants
? In principle, it should give similar timings as using directly numpy. If it does not then could you try with "expansion_by_species_method":'user defined', "global_species":[THE ATOMIC NUMBERS PRESENT IN YOUR DATASET]
. This should be strictly the same as using numpy or sklearn but who knows...
Including "expansion_by_species_method":"structure wise"
improves the performance (5.5 seconds). Including "expansion_by_species_method":'user defined', "global_species":[THE ATOMIC NUMBERS PRESENT IN YOUR DATASET]
helps as well (6.5 seconds).
For a larger number of environments, sklearn is three times faster using the options you mentioned.
Looks ok to me, thank for your tests. @Luthaf, @ceriottm and @agoscinski what do you think ? We could also use a numpy implementation rather than the rascal one in this case ? Or at least give the possibility, because the sparse storage is quite important for some applications?
We just would need to replace the one line in kernel computation for the case kernel_type Full
to a numpy multiplication, right? This would still allow people to use the sparse storage in cases needed, while using the faster numpy routines in the full case.
https://github.com/cosmo-epfl/librascal/blob/master/bindings/rascal/models/kernels.py#L143