deeplearning4j-docs
deeplearning4j-docs copied to clipboard
Tutorial on chemical properties using SMILE
Due Date
To be completed by: 2018-04-25
Description
@jrmerwin As discussed, forming a deep learning tutorial on how to predict a chemical property when using its composition in a SMILE format.
Note: not to be confused with smile-scala, this is a chemical property notation.
Assignees
Please ensure you have assigned at least one person to this issue. Include any authors and reviewers required.
50 million ~ 1 billion compounds in SMILE format: http://gdb.unibe.ch/downloads/
20 million compounds, perhaps also useful: https://github.com/isayev/ANI1_dataset
Did we consult @AlexDBlack on this? He likely has strong opinions about it, some war stories about the SMILES data, and maybe some code we can repurpose?
Yeah, @crockpotveggies and I discussed this privately on gitter. tl;dr data in just about any format would work for a basic demo, as we can convert it to SMILES or whatever as required. As for the net, I was suggesting perhaps an MLP using ECFP4, and maybe (depending on the task) also a bidirectional RNN directly on SMILES. I'm happy to do the ECFP4 conversion if required to produce a 'clean' dataset. If we go the RNN route, it might make sense to finally build a proper character sequence record reader in DataVec.