chemprop [v2] [FEATURE]: eliminate `molecule_featurizer` input from `MoleculeDatapoint`

Description Currently, a MoleculeDatapoint object takes an input both extra features x_f: np.ndarray and a list of molecule featurizers to create extra features on the fly. We don't allow users to supply both even though they correspond to the same thing. Building extra features on-the-fly by way of a list of input molecule featurizers seems overly complicated. The alternative (and I think clearer) option is to remove this functionality and just require users precompute all extra features for their molecule and supply them in the initializer. This will allow us to accommodate "both" options by coercing one into the other

Mar 28 '24 18:03 davidegraff

Agreed, will do, and will add notebook showing how to do this.

Apr 01 '24 18:04 KnathanM

As this doesn't affect saved models (because it deals with featurization), I had this as lower priority. Models trained on v2.0 will still work on v2.1, just the data preprocessing would need to be changed slightly if someone is using mfs. We've run a bit out of time to have this done for v2.0, so we've changed the milestone.

Apr 12 '24 23:04 KnathanM

I now feel it is better to keep both options of pre-generating molecular features and having the Datapoint calculate them at initialization. A Chem.Mol object is needed to use a molecular featurizer, but some users (including the CLI) only work with SMILES, relying on the Datapoint to create the Chem.Mol object for them. If we don't have the option of Datapoint also creating the molecular features, then those users will need to create the mol objects first, which for the CLI means copying the code in .from_smi. Arguably its not that much code duplication for the CLI, but also module users who want molecule features will have to learn to use make_mol or its equivalent as an extra step, instead of being able to just use .from_smi.

Also begs the question of why the CLI allows for specifying a molecule featurizer instead of forcing users to first use a script to create and save their molecule features and then pass them in using --descriptors-path. I would have thought this was better, but four separate people have asked for the option to use --features_generator rdkit_2d_normalized (see #843 for a couple of them) so it appears users are use to just having Chemprop generate the molecule features for them.

(Technically descriptastorus generates theses rdkit molecule features, and it is another discussion about whether to bring that back as a dependency. I'm leaning towards not because it makes installation more complicated, but could be convinced.)

There is another question of whether Datapoint should accept atom and bond featurizer functions and generate those extra features at initialization as well. While basic users may want extra molecule features, I don't think basic users will want extra atom and bond featurizers so it is okay if those have to be pre-generated.

Lastly, I think we should remove the limitation of not allowing both pre-generated extra features and molecular featurizers. We can just concatenate them.

May 22 '24 14:05 KnathanM

I now feel it is better to keep both options of pre-generating molecular features and having the Datapoint calculate them at initialization. A Chem.Mol object is needed to use a molecular featurizer, but some users (including the CLI) only work with SMILES, relying on the Datapoint to create the Chem.Mol object for them. If we don't have the option of Datapoint also creating the molecular features, then those users will need to create the mol objects first, which for the CLI means copying the code in .from_smi. Arguably its not that much code duplication for the CLI, but also module users who want molecule features will have to learn to use make_mol or its equivalent as an extra step, instead of being able to just use .from_smi.

IMO from_smi() is a convenience method, so it's not that much to ask that users go through __init__() if they want full functionality. So yes, we would need to copy code from from_smi() into the CLI, but this isn't a big deal because it's only a couple of lines. This would also lead to a smaller footprint of the MoleculeDatapoint class.

Also begs the question of why the CLI allows for specifying a molecule featurizer instead of forcing users to first use a script to create and save their molecule features and then pass them in using --descriptors-path. I would have thought this was better, but four separate people have asked for the option to use --features_generator rdkit_2d_normalized (see #843 for a couple of them) so it appears users are use to just having Chemprop generate the molecule features for them.

(Technically descriptastorus generates theses rdkit molecule features, and it is another discussion about whether to bring that back as a dependency. I'm leaning towards not because it makes installation more complicated, but could be convinced.)

There is another question of whether Datapoint should accept atom and bond featurizer functions and generate those extra features at initialization as well. While basic users may want extra molecule features, I don't think basic users will want extra atom and bond featurizers so it is okay if those have to be pre-generated.

This is why I think we should eliminate the molecule_featurizer argument, because then it begs this followup question which really only serves to produce bloat in the codebase. When you add these in, there's 2 ways to get to the same input for each feature, and this is not a good thing. There should really only ever be 1 way to do something because we don't want users asking "what's the difference between X and Y?" when functionally there is no difference.

It's why I'm in favor of making it such that there's only one way to provide additional atom/bond/molecule features via explicitly providing them rather than generating them on-the-fly.

Lastly, I think we should remove the limitation of not allowing both pre-generated extra features and molecular featurizers. We can just concatenate them.

I agree with this general point. I never understood why this wasn't the case, but I kept it in from v1.

May 22 '24 14:05 davidegraff