keras-molecules icon indicating copy to clipboard operation
keras-molecules copied to clipboard

sample: added --smiles and --save_json options

Open dribnet opened this issue 8 years ago • 7 comments

--smiles can be used to encode smiles strings that are not in the dataset. The dataset is still used to do the charset one-hot encoding correctly.

--save_json was also added as an alternate output format

dribnet avatar Nov 07 '16 12:11 dribnet

I've been thinking about this PR and I'm not a big fan of the direction it takes us. Do we add parallel json handling next to h5 in all of the other scripts? Do we let the interfaces skew? This adds global-sounding flags but which are only relevant for the encoder target.

I'd much rather keep hfd5 as the serialization scheme here (json becomes really problematic on larger datasets) and add a script for working between hdf5 and json if needed.

As for --smiles, I'd rather that read from stdin which is a better non-data-file interface.

maxhodak avatar Nov 09 '16 08:11 maxhodak

Hi, sorry for the delay responding. I'll remove --save_json and update --smiles to use stdin. Longer thoughts below.

The main contribution here is --smiles. It was slightly tricky to get this logic right, but I think it's valuable to reuse a trained model on data not in the original training set is valuable - which means encoding the new smiles string against an existing charset. I've done some experiments and it is working well for me with the model_500k.h5 provided trained model. I agree that using stdin would be a good alternative and can make this change.

The --save_json was just added as a small afterthought to make this library compatible with my own plat library. It's only a few lines of code and thought it provided a slightly cleaner output than the default savetxt. But I'm happy to remove it since it's fairly trivial to take the savetxt output and transform it into this json form.

dribnet avatar Nov 11 '16 21:11 dribnet

I think it's valuable to reuse a trained model on data not in the original training set is valuable - which means encoding the new smiles string against an existing charset.

Yes, I definitely agree with this. I think the bigger thing that's wrong here is there's no reason to store the charset with the data to operate on at any given time. I've been thinking about pulling out a --charset or even as a required parameter for a while but it seemed like extra work. What do you think of that instead of extra specifying new data?

cc @dakoner

maxhodak avatar Nov 11 '16 23:11 maxhodak

Yep, I'd want a --charset charset.h5

This would work even with legacy files that have the 'charset' data in them (model, encoder output).

dakoner avatar Nov 12 '16 00:11 dakoner

Sure, I think --charset is a great idea. I can introduce that when I refactor this pull request.

dribnet avatar Nov 12 '16 21:11 dribnet

@dribnet You might like to take a look at my PR #43, which hard codes a charset, and provides a helper object for decoding and encoding strings given that charset. We can refactor your --smiles flag to provide a file-like object that streams in SMILES strings, which get converted on the fly to one-hot-matrices and then encoded/autoencoded.

Is this what you were thinking of / is that what is useful?

pechersky avatar Nov 23 '16 17:11 pechersky

I've been a little out of this the last week as my GPU rig is down and I've been traveling so unable to get to it to fix things. I've landed #43 though so that should simplify the charset questions here.

maxhodak avatar Nov 25 '16 07:11 maxhodak