matminer icon indicating copy to clipboard operation
matminer copied to clipboard

Optical and transport data as elemental pseudo-inverse contributions

Open gbrunin opened this issue 2 years ago • 4 comments

Summary

This is work done with @davidwaroquiers, @gpetretto and @gmrigna.

The idea is to use the data from refractiveindex.info and the transport properties from the Materials Project to featurize new systems based on their composition.

As an example, let's take the effective mass of electrons. From the MP, we have >45 000 systems with corresponding effective masses. We can write the equations Composition matrix x Pseudo-inverse contributions ≃ Effective masses where each of these matrices have > 45 000 lines. The composition matrix has a number of columns equal to the number of chemical elements present in the dataset, and the others have a single column. The pseudo-inverse contributions can be computed for a given dataset. They represent the least-square fit between the compositions and the effective mass, and can be seen as the average contribution of each element to the effective mass once they are present in a system (could be negative if the presence of an element generally decreases the effective mass).

From our tests on industrial cases, including these pseudo-inverse contributions as composition features improves the ML models (it will depend on what is predicted though).

In this PR, we have done this for optical data (refractive index, extinction coefficient, reflectivity), as taken from refractiveindex.info, and for transport properties (all those present in the MP). For optical data, the properties are spectra and by default 10 wavelengths are selected in the visible range. This range and frequency selection can be changed by the user if, say, the IR spectra is more important for their application. The code can be used to generate new pseudo-inverse contributions from new data and add these as features as well.

TODO

Since the user can change the range and sampling of the optical spectra, the whole database from refractiveindex should be stored. We have added it in a tar.xz format (< 2 Mb). The code starts by untarring the file in a ~/.matminer directory that can be changed manually by the user if this is not desirable. This is to avoid adding too much untarred files in the source code that would more than double the current size of the repo. This is of course open for discussion, depending on what you would prefer.

We are open to having a chat about all this if you think it is necessary. Maybe I did not explain everything correctly and things have to be clarified.

gbrunin avatar Nov 18 '22 13:11 gbrunin

Good work @gbrunin!

Following up on this PR, @computron is there anyone we should contact to have it merged or discussed ?

Thanks,

David

davidwaroquiers avatar Jan 09 '23 08:01 davidwaroquiers

Hello @janosh,

Would you need any additional information to follow up on this PR ?

Thanks,

David

davidwaroquiers avatar Feb 14 '23 10:02 davidwaroquiers

@davidwaroquiers Sorry to say I'm prob not the right person to merge this. Maybe ping @computron and @ardunn again for green-lighting a big PR like this one.

janosh avatar Feb 14 '23 14:02 janosh

@davidwaroquiers Sorry to say I'm prob not the right person to merge this. Maybe ping @computron and @ardunn again for green-lighting a big PR like this one.

Hello @janosh ,

Ok thanks for the update!

@computron and @ardunn do you need any additional input about this topic ?

Best,

davidwaroquiers avatar Feb 14 '23 14:02 davidwaroquiers

3.9 failures are still simply for the Test PyPI upload which won't work from forks (I'll probably fix this at some point after this PR) -- see #933

ml-evs avatar Apr 09 '24 13:04 ml-evs

I'm happy that everything works locally, and we're fine to merge the dataset in. I'll raise a couple of minor issues that have come up, but otherwise great work and thanks again @gbrunin!

ml-evs avatar Apr 10 '24 13:04 ml-evs