deeplearning4j-docs icon indicating copy to clipboard operation
deeplearning4j-docs copied to clipboard

Tutorial on chemical properties using SMILE

Open crockpotveggies opened this issue 6 years ago • 4 comments

Due Date

To be completed by: 2018-04-25

Description

@jrmerwin As discussed, forming a deep learning tutorial on how to predict a chemical property when using its composition in a SMILE format.

Note: not to be confused with smile-scala, this is a chemical property notation.

Assignees

Please ensure you have assigned at least one person to this issue. Include any authors and reviewers required.

crockpotveggies avatar Apr 12 '18 15:04 crockpotveggies

50 million ~ 1 billion compounds in SMILE format: http://gdb.unibe.ch/downloads/

crockpotveggies avatar Apr 14 '18 18:04 crockpotveggies

20 million compounds, perhaps also useful: https://github.com/isayev/ANI1_dataset

crockpotveggies avatar Apr 15 '18 19:04 crockpotveggies

Did we consult @AlexDBlack on this? He likely has strong opinions about it, some war stories about the SMILES data, and maybe some code we can repurpose?

turambar avatar Apr 18 '18 21:04 turambar

Yeah, @crockpotveggies and I discussed this privately on gitter. tl;dr data in just about any format would work for a basic demo, as we can convert it to SMILES or whatever as required. As for the net, I was suggesting perhaps an MLP using ECFP4, and maybe (depending on the task) also a bidirectional RNN directly on SMILES. I'm happy to do the ECFP4 conversion if required to produce a 'clean' dataset. If we go the RNN route, it might make sense to finally build a proper character sequence record reader in DataVec.

AlexDBlack avatar Apr 19 '18 00:04 AlexDBlack