keras-molecules icon indicating copy to clipboard operation
keras-molecules copied to clipboard

add script to convert rdkit mols to networkx graphs with attributes

Open dakoner opened this issue 8 years ago • 7 comments

It's not clear to me this code belongs here, I'm happy to make a new repo to hold this kind of code.

dakoner avatar Nov 13 '16 15:11 dakoner

BTW, with this script I went on to print all the 8-membered SMILES strings in GDB-13 (8.smi) along with their graph6 representation. I sorted by graph6 representation, there are some graph6 strings that are associated with only 1 8-membered SMILES string in GDB-13, for example: CC1C2C3CC(C3)N12 GhCGjG (it has an interesting structure) while GhCGGC has 1406 associated SMILES strings; for example, C=CC=CC=CC=C, which is polyacetyene(?)

Looking at 9-membered SMILES strings shows similar results.

Part of this is due to GDB-13 being a selective enumeration. So, the counts for various graphs are affected by GDB-13's aggressive filters during the graph enumeration phase.

dakoner avatar Nov 13 '16 20:11 dakoner

smilesparser.py looks pretty interesting. What direction are you thinking of taking this?

maxhodak avatar Nov 25 '16 07:11 maxhodak

I have several plans for the SMILES parser. It exists mainly beause I couldn't find any other SMILES parser that represented the SMILES as an AST (rdkit converts it to a molecule, which is technically an AST for SMILES...) in Python that I could easily traverse.

Goals include:

  1. generative production of valid SMILES strings, including modifying existing molecules, extracting the graph structure
  2. exploring alternatives to the simple single-letter charset (like 'Br' and 'Cl'), as well as expressing the tree structure directly.
  3. and using graph isomorphism tools to cluster similar structures.

On Thu, Nov 24, 2016 at 11:24 PM, Max Hodak [email protected] wrote:

smilesparser.py looks pretty interesting. What direction are you thinking of taking this?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/pull/32#issuecomment-262895366, or mute the thread https://github.com/notifications/unsubscribe-auth/AHtyQPQisayEE7qJNfpNLSxuN3-1r4XKks5rBo1HgaJpZM4KwuQr .

dakoner avatar Nov 25 '16 15:11 dakoner

max, etc:

I got approved to release the smilesparser as Google Open Source code with Apache 2 license. Woot.

https://github.com/google/smilesparser

You can see some examples of recursively iterating over a SMILES string's parsed structure:

On Fri, Nov 25, 2016 at 7:51 AM, David Konerding [email protected] wrote:

I have several plans for the SMILES parser. It exists mainly beause I couldn't find any other SMILES parser that represented the SMILES as an AST (rdkit converts it to a molecule, which is technically an AST for SMILES...) in Python that I could easily traverse.

Goals include:

  1. generative production of valid SMILES strings, including modifying existing molecules, extracting the graph structure
  2. exploring alternatives to the simple single-letter charset (like 'Br' and 'Cl'), as well as expressing the tree structure directly.
  3. and using graph isomorphism tools to cluster similar structures.

On Thu, Nov 24, 2016 at 11:24 PM, Max Hodak [email protected] wrote:

smilesparser.py looks pretty interesting. What direction are you thinking of taking this?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/pull/32#issuecomment-262895366, or mute the thread https://github.com/notifications/unsubscribe-auth/AHtyQPQisayEE7qJNfpNLSxuN3-1r4XKks5rBo1HgaJpZM4KwuQr .

dakoner avatar Dec 01 '16 21:12 dakoner

Sorry, the example for iteration is: https://github.com/google/smilesparser/blob/master/test_smilesparser_object.py

That code is a sufficient example to see how to identify terminals, and it should be pretty obvious how to parse out element names if you wanted to use those to define the charset (although at this point I don't think it really matters).

On Thu, Dec 1, 2016 at 1:39 PM, David Konerding [email protected] wrote:

max, etc:

I got approved to release the smilesparser as Google Open Source code with Apache 2 license. Woot.

https://github.com/google/smilesparser

You can see some examples of recursively iterating over a SMILES string's parsed structure:

On Fri, Nov 25, 2016 at 7:51 AM, David Konerding [email protected] wrote:

I have several plans for the SMILES parser. It exists mainly beause I couldn't find any other SMILES parser that represented the SMILES as an AST (rdkit converts it to a molecule, which is technically an AST for SMILES...) in Python that I could easily traverse.

Goals include:

  1. generative production of valid SMILES strings, including modifying existing molecules, extracting the graph structure
  2. exploring alternatives to the simple single-letter charset (like 'Br' and 'Cl'), as well as expressing the tree structure directly.
  3. and using graph isomorphism tools to cluster similar structures.

On Thu, Nov 24, 2016 at 11:24 PM, Max Hodak [email protected] wrote:

smilesparser.py looks pretty interesting. What direction are you thinking of taking this?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/pull/32#issuecomment-262895366, or mute the thread https://github.com/notifications/unsubscribe-auth/AHtyQPQisayEE7qJNfpNLSxuN3-1r4XKks5rBo1HgaJpZM4KwuQr .

dakoner avatar Dec 01 '16 21:12 dakoner

@dakoner I'm a lurker on this repo, but very cool to see a good python SMILES parser. Do you think it would be tough to extend it to handle Reaction SMARTS? (for better support of chemical reactions)

rbharath avatar Dec 01 '16 23:12 rbharath

I assume it's straightforward.

This parser was constructed by taking an existing BNF grammar for SMILES and manually translating it to pyparsing. There is a simple transformation for most grammar to pyparsing- the only tricky parts involve recursively defined elements (see the pp.Forward() lines in smilesparser.py). be sure to call pyparsing.validate on grammars to check that you don't have infinite recursion.

If there is a BNF for Reaction SMARTS (I couldn't find one) then you can just translate it the same way I did. I couldn't find one. I'm sure you could also write a parser from this page: http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html although it would take a bit more effort than translating a BNF.

On Thu, Dec 1, 2016 at 3:09 PM, Bharath Ramsundar [email protected] wrote:

@dakoner https://github.com/dakoner I'm a lurker on this repo, but very cool to see a good python SMILES parser. Do you think it would be tough to extend it to handle Reaction SMARTS? (for better support of chemical reactions)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/pull/32#issuecomment-264324059, or mute the thread https://github.com/notifications/unsubscribe-auth/AHtyQE-nM-JE31eDe94zroRI-1w0O8Zyks5rD1OVgaJpZM4KwuQr .

dakoner avatar Dec 01 '16 23:12 dakoner