smarty
smarty copied to clipboard
Discussion: Hierarchical approach as currently implemented poses problems for sub-types
The AlkEthOH test set reveals some "sampling problems" which we believe need to be resolved by some design modifications.
Specifically, the hydrogen
element contains multiple sub-types in AlkEthOH - most obviously, HC
and HO
which are hydrogens with attached carbons and hydrogens with attached oxygens. However, there are additional sub-types H1
, H2
, and H3
that are sub-types of HC
relating to what other type of atom is connected. The tree looks something like this:
hydrogen
/ \
HC HO
|
H1
|
H2
|
H3
Currently, as we understand it, the scoring approach dictates smarty
can only discover the H1
, H2
, and H3
types if it first discovers HC
(and we haven't seen them discovered yet regardless). This is because if smarty
matches HO
first, then, since all remaining types are covered by HC
, discovering HC
leads to no improvements to the score AND, worse, would lead to no atoms being typed by the generic hydrogen
(so this automatically gets rejected since each pattern is required to match at least some types).
It seems clear that in the limit of infinite sampling at finite temperature, one ought to be able to explore both branches and find that only the "discover HC first" branch yields to achieving the best possible score. However, we think that: a) this is an inefficient way of exploring b) we can easily make the "greedy" algorithm work better, and c) this will likely result in other problems in more complicated trees - i.e. if both branches extended downwards several levels, we would only be able to explore one or the other but not both (since the generic type is required to match at least some atoms).
We believe that at least two changes should be made to resolve this; I'll lay them out briefly here and then can follow up by creating issues on the individual items once we agree on the concept:
- Base atom types should be allowed to match no atoms and there should be no benefit to the score for them to match atoms; these base atom types should get special treatment.
- Indeed, Christopher would argue that base atom types OUGHT to end up matching no atoms HERE; these out to be basically a catch-all for anything we can't recognize or don't know anything about yet
- This will help because then developing new specialized types
HC
andHO
in the example above will result in score increases
Postpone discussion of this point following, need to clarify something with Bayly.
2) "Reward exploration of trees which contain more derivative atom types" (more details on scoring function proposal in a subsequent issue): There should be a score reward for creating a new sub-type which encompasses several derivatives of an existing type, i.e. we should prefer creation of the HC
sub-type over the HO
sub-type somewhat because HC
removes several atom types at once from the generic hydrogen
and thus is a more promising avenue for discovery of further sub-types.
Can you elaborate on the definition of H1
, H2
, and H3
types?
Ideally with SMARTS definitions?
@cbayly13 @bannanc - can you provide examples?
@jchodera - H1 corresponds to a hydrogen attached to a carbon attached to 1 electron withdrawing group (it's radius is smaller due to the electron withdrawing group). H2 and H3 are similar, but it is a carbon with 2 or 3 electron withdrawing groups respectively.
@cbayly13 made SMARTS for all of the parameter types in the AlkEthOH set, here are the strings for the H 1,2,3 hydrogens.
H1: [$([#1]-C-[#7,#8,F,#16,Cl,Br])]
H2: [$([#1]-C(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])]
H3: [$([#1]-C(-[#7,#8,F,#16,Cl,Br])(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])]
To clarify - we think there are really two problems with respect to these SPECIFIC types, one being that we can't explore the HC
branch of the tree at all if we go to HO
first, and secondarily we don't think we're creating the right types of descriptors yet (which @cbayly13 is working on -- see #12 )
Ah! OK.
It's important to note that there isn't a single unique hierarchical typing ruleset that will give these atom types. This could be represented many different ways.
A big question is whether, after defining
[#1]-C HC
[#1]-O HO
[$([#1]-C-[#7,#8,F,#16,Cl,Br])] H1
[$([#1]-C(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])] H2
[$([#1]-C(-[#7,#8,F,#16,Cl,Br])(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])] H3
the parent type (hydrogen) still types any atoms. If it doesn't, the tree could be represented in many ways:
[#1] HC
[#1]-O HO
[$([#1]-C-[#7,#8,F,#16,Cl,Br])] H1
[$([#1]-C(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])] H2
[$([#1]-C(-[#7,#8,F,#16,Cl,Br])(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])] H3
or
[#1] HO
[#1]-C HC
[$([#1]-C-[#7,#8,F,#16,Cl,Br])] H1
[$([#1]-C(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])] H2
[$([#1]-C(-[#7,#8,F,#16,Cl,Br])(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])] H3
or
[#1] H1
[#1]-C HC
[$([#1]-C(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])] H2
[$([#1]-C(-[#7,#8,F,#16,Cl,Br])(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])] H3
[#1]-O HO
or
[#1] H3
[#1]-C HC
[$([#1]-C-[#7,#8,F,#16,Cl,Br])] H1
[$([#1]-C(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])] H2
[#1]-O HO
or many, many other schemes that correctly type all the atoms in a last-one-wins manner.
In light of this, I don't think the specific concern that you can only reach the desired typing along a single route is correct, but you may be right that we might be overly constraining exploration by requiring both child and parent to match some atom types.
If we relax this requirement, however, how do we prevent ridiculous elaboration of atom types that do not match atoms? Should we allow parents to not match types but require children do? That might provide more "evolutionary fodder" for child atom types while not increasing model complexity.
If we relax this requirement, however, how do we prevent ridiculous elaboration of atom types that do not match atoms? Should we allow parents to not match types but require children do? That might provide more "evolutionary fodder" for child atom types while not increasing model complexity.
Yes, this is exactly what we want. Sorry, I should have been more explicit.
(I'm trying to translate two hours of whiteboard scribblings amongst four or five of us into GitHub issues and it doesn't always go well.)
That should be easy. Do you want me to implement that? (Will take two minutes.)
Here is my output for the rejection, this can be helpful to understand why it has been rejected. The AtomTyper classifies all the H bound to Carbon as hydrogen carbon adjacent, so hydrogen base type (hydrogen) will not match with anything. It's printing the first molecule from AlkEthOH, just to have an example.
Attempting to create new subtype: '[#1]' (hydrogen) + '$(*~[#6])' (carbon-adjacent) -> '[#1&$(*~[#6])]' (hydrogen carbon-adjacent)
Computing type statistics
carbon
carbon
carbon
oxygen
oxygen total-h-count-1
oxygen total-h-count-1
hydrogen carbon-adjacent
hydrogen carbon-adjacent
hydrogen carbon-adjacent
hydrogen carbon-adjacent
hydrogen oxygen-adjacent
hydrogen oxygen-adjacent
Parent type '[#1]' (hydrogen) now unused in dataset; rejecting.
Rejected.
Computing type statistics
carbon
carbon
carbon
oxygen
oxygen total-h-count-1
oxygen total-h-count-1
hydrogen
hydrogen
hydrogen
hydrogen
hydrogen oxygen-adjacent
hydrogen oxygen-adjacent
That should be easy. Do you want me to implement that? (Will take two minutes.)
@jchodera - if you want to do that, we'd be delighted. Though note this is something we can do, so we'd rather have the SMIRFF XML if given the choice. ;)
(i.e. if you let us do this, we'll get better at handling smarty on our own, even though it will take us longer than two minutes...)
How about you take a stab at it and ask if you run into trouble!
How about you take a stab at it and ask if you run into trouble!
Perfect.
Just to clarify a question I left unanswered above:
A big question is whether, after defining
[#1]-C HC [#1]-O HO [$([#1]-C-[#7,#8,F,#16,Cl,Br])] H1 [$([#1]-C(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])] H2 [$([#1]-C(-[#7,#8,F,#16,Cl,Br])(-[#7,#8,F,#16,Cl,Br])-[#7,#8,F,#16,Cl,Br])] H3
the parent type (hydrogen) still types any atoms. If it doesn't, the tree could be represented in many ways
The answer is "no". The parent type hydrogen does NOT still match anything in that case, so the tree as I drew it is correct.
I'll see if I can resolve this issue myself.