modnet icon indicating copy to clipboard operation
modnet copied to clipboard

CompositionOnly featurizer preset does not use oxidation states

Open ml-evs opened this issue 4 years ago • 6 comments

As above. We should probably add the oxidation state featurizers used in the DeBreuck2020Featurizer to the CompositionOnly featurizer too.

ml-evs avatar Apr 13 '21 16:04 ml-evs

Hi! Was this implemented or is it still pending?

ancarnevali avatar Mar 22 '22 10:03 ancarnevali

Hi! Thanks for pulling this up. This hasn't been implemented but can easily be added manually. Small example:

from modnet.featurizers.presets import DeBreuck2020Featurizer
from modnet.preprocessing import MODData
from pymatgen.core import Composition

materials = [Composition("SiO2"),Composition("Li20")]
class CompositionOxidFeaturizer(DeBreuck2020Featurizer):
    def __init__(self):
        super().__init__()
        self.structure_featurizers = ()
        self.site_featurizers = ()

data = MODData(materials=materials, featurizer=CompositionOxidFeaturizer())
data.featurize()
print(data.df_featurized)

This will add 9 extra features compared to the current default one. Let me know if this helps. I might try this with some of the matbench tasks and implement this (maybe default).

ppdebreuck avatar Mar 22 '22 11:03 ppdebreuck

I will just add one extra bit to @ppdebreuck's suggestion, which is to add the fast_oxid attribute to the featurizer, otherwise pymatgen (the underlying library matminer uses for oxidation states) will try to allow for every site in a structure to have a different oxidation state which scales horribly with the number of sites. Doing something like:

class CompositionOxidFeaturizer(DeBreuck2020Featurizer):
    def __init__(self):
        super().__init__()
        self.structure_featurizers = ()
        self.site_featurizers = ()
        self.fast_oxid = True

will use more sensible defaults for oxidation state calculations that you can then test: https://github.com/ppdebreuck/modnet/blob/06ca45f104a697e65d02d5405d26aec78f712972/modnet/featurizers/featurizers.py#L161-L170)

(also see the relevant code in matminer/pymatgen)

ml-evs avatar Mar 22 '22 11:03 ml-evs

Thank you for your inputs! so if I understand correctly, with this class we just add features, and the "base" featurizer is launched with data.featurize() later on. By the way, since this would be a composition-only task, why are there self.structure_featurizers = () and self.site_featurizers = ()? Is it just needed to define them as empty?

ancarnevali avatar Mar 22 '22 12:03 ancarnevali

Thank you for your inputs! so if I understand correctly, with this class we just add features, and the "base" featurizer is launched with data.featurize() later on.

Yep, the base featurizer just knows to take each listed matminer featurizer in turn and apply it to the relevant sites, structure or composition.

By the way, since this would be a composition-only task, why are there self.structure_featurizers = () and self.site_featurizers = ()? Is it just needed to define them as empty?

Yeah, this is just convenience/laziness. You could instead inherit from the base featurizer and list out all of the matminer featurizer classes you wanted to use, but as we have them all already in the DeBreuck2020 preset it is easier base on that and disable the ones we don't want.

ml-evs avatar Mar 22 '22 13:03 ml-evs

At some point the IonProperty featurizer in matminer started using oxidation states (maybe it always did). We never ran into an issue with this as the poor scaling was probably being masked by the time taken for other featurizers, or we were only using it on simple compositions. As shown in #126, the featurizers basically hang if a sufficiently complex composition is attempted.

Possible workarounds:

  • Move IonProperty into the oxid_composition_featurizer list instead (so it will no longer be used unless there is a structure).
  • or allow for the selection of fast_oxid for the composition-only featurisers, as suggested above.

ml-evs avatar Jan 10 '23 15:01 ml-evs