smarty icon indicating copy to clipboard operation
smarty copied to clipboard

Discussion: Hierarchical approach as currently implemented poses problems for sub-types

Open davidlmobley opened this issue 8 years ago • 15 comments

The AlkEthOH test set reveals some "sampling problems" which we believe need to be resolved by some design modifications.

Specifically, the hydrogen element contains multiple sub-types in AlkEthOH - most obviously, HC and HO which are hydrogens with attached carbons and hydrogens with attached oxygens. However, there are additional sub-types H1, H2, and H3 that are sub-types of HC relating to what other type of atom is connected. The tree looks something like this:

         hydrogen
        /         \
      HC           HO
       |
      H1
       | 
      H2
       | 
      H3

Currently, as we understand it, the scoring approach dictates smarty can only discover the H1, H2, and H3 types if it first discovers HC (and we haven't seen them discovered yet regardless). This is because if smarty matches HO first, then, since all remaining types are covered by HC, discovering HC leads to no improvements to the score AND, worse, would lead to no atoms being typed by the generic hydrogen (so this automatically gets rejected since each pattern is required to match at least some types).

It seems clear that in the limit of infinite sampling at finite temperature, one ought to be able to explore both branches and find that only the "discover HC first" branch yields to achieving the best possible score. However, we think that: a) this is an inefficient way of exploring b) we can easily make the "greedy" algorithm work better, and c) this will likely result in other problems in more complicated trees - i.e. if both branches extended downwards several levels, we would only be able to explore one or the other but not both (since the generic type is required to match at least some atoms).

We believe that at least two changes should be made to resolve this; I'll lay them out briefly here and then can follow up by creating issues on the individual items once we agree on the concept:

  1. Base atom types should be allowed to match no atoms and there should be no benefit to the score for them to match atoms; these base atom types should get special treatment.
  • Indeed, Christopher would argue that base atom types OUGHT to end up matching no atoms HERE; these out to be basically a catch-all for anything we can't recognize or don't know anything about yet
  • This will help because then developing new specialized types HC and HO in the example above will result in score increases

Postpone discussion of this point following, need to clarify something with Bayly. 2) "Reward exploration of trees which contain more derivative atom types" (more details on scoring function proposal in a subsequent issue): There should be a score reward for creating a new sub-type which encompasses several derivatives of an existing type, i.e. we should prefer creation of the HC sub-type over the HO sub-type somewhat because HC removes several atom types at once from the generic hydrogen and thus is a more promising avenue for discovery of further sub-types.

davidlmobley avatar Jun 20 '16 23:06 davidlmobley

Can you elaborate on the definition of H1, H2, and H3 types?

jchodera avatar Jun 20 '16 23:06 jchodera

Ideally with SMARTS definitions?

jchodera avatar Jun 20 '16 23:06 jchodera

@cbayly13 @bannanc - can you provide examples?

davidlmobley avatar Jun 20 '16 23:06 davidlmobley

@jchodera - H1 corresponds to a hydrogen attached to a carbon attached to 1 electron withdrawing group (it's radius is smaller due to the electron withdrawing group). H2 and H3 are similar, but it is a carbon with 2 or 3 electron withdrawing groups respectively.

@cbayly13 made SMARTS for all of the parameter types in the AlkEthOH set, here are the strings for the H 1,2,3 hydrogens.

H1: [$([#1]-C-[#7,#8,F,#16,Cl,Br])]
H2: [$([#1]-C(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])]
H3: [$([#1]-C(-[#7,#8,F,#16,Cl,Br])(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])]

bannanc avatar Jun 21 '16 00:06 bannanc

To clarify - we think there are really two problems with respect to these SPECIFIC types, one being that we can't explore the HC branch of the tree at all if we go to HO first, and secondarily we don't think we're creating the right types of descriptors yet (which @cbayly13 is working on -- see #12 )

davidlmobley avatar Jun 21 '16 00:06 davidlmobley

Ah! OK.

It's important to note that there isn't a single unique hierarchical typing ruleset that will give these atom types. This could be represented many different ways.

A big question is whether, after defining

[#1]-C         HC
[#1]-O         HO
[$([#1]-C-[#7,#8,F,#16,Cl,Br])]                                                   H1
[$([#1]-C(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])]                             H2
[$([#1]-C(-[#7,#8,F,#16,Cl,Br])(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])]       H3

the parent type (hydrogen) still types any atoms. If it doesn't, the tree could be represented in many ways:

[#1]           HC
[#1]-O         HO
[$([#1]-C-[#7,#8,F,#16,Cl,Br])]                                                   H1
[$([#1]-C(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])]                             H2
[$([#1]-C(-[#7,#8,F,#16,Cl,Br])(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])]       H3

or

[#1]             HO
[#1]-C         HC
[$([#1]-C-[#7,#8,F,#16,Cl,Br])]                                                   H1
[$([#1]-C(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])]                             H2
[$([#1]-C(-[#7,#8,F,#16,Cl,Br])(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])]       H3

or

[#1]             H1
[#1]-C         HC
[$([#1]-C(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])]                             H2
[$([#1]-C(-[#7,#8,F,#16,Cl,Br])(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])]       H3
[#1]-O         HO

or

[#1]             H3
[#1]-C         HC
[$([#1]-C-[#7,#8,F,#16,Cl,Br])]                                                   H1
[$([#1]-C(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])]                             H2
[#1]-O         HO

or many, many other schemes that correctly type all the atoms in a last-one-wins manner.

In light of this, I don't think the specific concern that you can only reach the desired typing along a single route is correct, but you may be right that we might be overly constraining exploration by requiring both child and parent to match some atom types.

If we relax this requirement, however, how do we prevent ridiculous elaboration of atom types that do not match atoms? Should we allow parents to not match types but require children do? That might provide more "evolutionary fodder" for child atom types while not increasing model complexity.

jchodera avatar Jun 21 '16 00:06 jchodera

If we relax this requirement, however, how do we prevent ridiculous elaboration of atom types that do not match atoms? Should we allow parents to not match types but require children do? That might provide more "evolutionary fodder" for child atom types while not increasing model complexity.

Yes, this is exactly what we want. Sorry, I should have been more explicit.

davidlmobley avatar Jun 21 '16 00:06 davidlmobley

(I'm trying to translate two hours of whiteboard scribblings amongst four or five of us into GitHub issues and it doesn't always go well.)

davidlmobley avatar Jun 21 '16 00:06 davidlmobley

That should be easy. Do you want me to implement that? (Will take two minutes.)

jchodera avatar Jun 21 '16 00:06 jchodera

Here is my output for the rejection, this can be helpful to understand why it has been rejected. The AtomTyper classifies all the H bound to Carbon as hydrogen carbon adjacent, so hydrogen base type (hydrogen) will not match with anything. It's printing the first molecule from AlkEthOH, just to have an example.

Attempting to create new subtype: '[#1]' (hydrogen) + '$(​*~[#6])' (carbon-adjacent) -> '[#1&$(*​~[#6])]' (hydrogen carbon-adjacent)
Computing type statistics
carbon
carbon
carbon
oxygen
oxygen total-h-count-1
oxygen total-h-count-1
hydrogen carbon-adjacent
hydrogen carbon-adjacent
hydrogen carbon-adjacent
hydrogen carbon-adjacent
hydrogen oxygen-adjacent
hydrogen oxygen-adjacent
Parent type '[#1]' (hydrogen) now unused in dataset; rejecting.
Rejected.
Computing type statistics
carbon
carbon
carbon
oxygen
oxygen total-h-count-1
oxygen total-h-count-1
hydrogen
hydrogen
hydrogen
hydrogen
hydrogen oxygen-adjacent
hydrogen oxygen-adjacent

camizanette avatar Jun 21 '16 00:06 camizanette

That should be easy. Do you want me to implement that? (Will take two minutes.)

@jchodera - if you want to do that, we'd be delighted. Though note this is something we can do, so we'd rather have the SMIRFF XML if given the choice. ;)

davidlmobley avatar Jun 21 '16 00:06 davidlmobley

(i.e. if you let us do this, we'll get better at handling smarty on our own, even though it will take us longer than two minutes...)

davidlmobley avatar Jun 21 '16 00:06 davidlmobley

How about you take a stab at it and ask if you run into trouble!

jchodera avatar Jun 21 '16 00:06 jchodera

How about you take a stab at it and ask if you run into trouble!

Perfect.

davidlmobley avatar Jun 21 '16 00:06 davidlmobley

Just to clarify a question I left unanswered above:

A big question is whether, after defining

[#1]-C         HC
[#1]-O         HO
[$([#1]-C-[#7,#8,F,#16,Cl,Br])]                                                   H1
[$([#1]-C(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])]                             H2
[$([#1]-C(-[#7,#8,F,#16,Cl,Br])(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])]       H3

the parent type (hydrogen) still types any atoms. If it doesn't, the tree could be represented in many ways

The answer is "no". The parent type hydrogen does NOT still match anything in that case, so the tree as I drew it is correct.

I'll see if I can resolve this issue myself.

davidlmobley avatar Jun 22 '16 20:06 davidlmobley