keras-molecules Replace SMILES input to Coulomb matrix

Replace SMILES input to Coulomb matrix

Open jeffrey9909 opened this issue 7 years ago • 6 comments

I am working on changing the input form SMILES to Coulomb matrix, 200 Coulomb matrices (29*29 matrix) with the HOMO-LUMO gap have been produced and saved into a .h5 file by the following code:

# Saving in .h5 format
h5f = h5py.File('processed.h5','w')
h5f.create_dataset('homo_lumo_gaps',data=homo_lumo_gaps)
h5f.create_dataset('padded_coulomb_matrices', data=padded_coulomb_matrices)

And I try to run the train.py directly with the generated process.h5 and give me this error message

KeyError: "Unable to open object (Object 'data_train' doesn't exist)"

I think that the problem comes from the way that I save the file is different from the original preprocess.py... But I cannot get the original idea and thus don't know how should I modify my code. The preprocess.py I am using is here

https://docs.google.com/document/d/17f9n7tzeadpCo0_pit548QiU1-Loib2opMcm0I4MxzQ/edit?usp=sharing

And I want to known other than the ''naming'' problem as I have mentioned, will the NN work as I expected if I directly import the Coulomb matrix to replace the SMILES strings? Is there any part of the code I will need to modify? Thank You. I know that this is not a good way to ask questions, but I really need some help. Any help is appreciated. Thank You.

Jan 20 '17 09:01 jeffrey9909

The h5 files is expected to have two groups: "data_train" and "data_test".

You might want to do something like this, from the repo's preprocess.py:

train_idx, test_idx = map(np.array, train_test_split(structures.index, test_size = 0.20))

Then, using a chunking function we defined, we do:

    create_chunk_dataset(h5f, 'data_train', train_idx,
                         (len(train_idx), 120, len(charset)),
                         apply_fn=lambda ch: np.array(map(one_hot_encoded_fn,
                                                          structures[ch])))
    create_chunk_dataset(h5f, 'data_test', test_idx,
                         (len(test_idx), 120, len(charset)),
                         apply_fn=lambda ch: np.array(map(one_hot_encoded_fn,
                                                          structures[ch])))

Jan 20 '17 14:01 pechersky

Thanks for your answer, Do you mean I should rename my coulomb_matrix to data_test and data_train (which means that they are the same), Then what should I do with my HOMO-LUMO gaps? Cause I have tried to do what you have mentioned, but I don't really get this part

 train_idx, test_idx = map(np.array, train_test_split(structures.index, test_size = 0.20))

and this part

    create_chunk_dataset(h5f, 'data_train', train_idx,
                     (len(train_idx), 120, len(charset)),
                     apply_fn=lambda ch: np.array(map(one_hot_encoded_fn,
                                                      structures[ch])))
    create_chunk_dataset(h5f, 'data_test', test_idx,
                     (len(test_idx), 120, len(charset)),
                     apply_fn=lambda ch: np.array(map(one_hot_encoded_fn,
                                                      structures[ch])))

that is why I failed to rename by dataset

Jan 20 '17 15:01 jeffrey9909

What I mean is that you have to split your dataset of coulomb matrices into a train set and a test set. A helper function to do that is the train_test_split function from sklearn.model_selection. Then, you would do something like

train_data, test_data = train_test_split(padded_coulomb_matrices, test_size = 0.20))
h5f.create_dataset('data_train', data = train_data)
h5f.create_dataset('data_test', data = test_data)

That will create an h5 with your data split into train and test as expected by train.py. However, know that train.py (and all the other scripts in the repo) use a particular network topology that probably won't work with the shape of your data. The model is defined at https://github.com/maxhodak/keras-molecules/blob/master/molecules/model.py. As you can see, the dimensions of the input tensors as defined are (max_length, len(charset)) = (120, 51), and are expected to be one_hot per row.

Jan 20 '17 16:01 pechersky

Oh, I think I get it. Thank You for your help. I will try to fix it tmr. Thank again.

Jan 20 '17 16:01 jeffrey9909

Just a question, how is the latent_dim determined to be 292? I am trying to modify the code by myself the moment, but I have not idea about this... Thanks.

Jan 24 '17 09:01 jeffrey9909

That was the latent dimension that was reported in the Gomez-Bombarelli paper that is referenced in the README.

On Tue, Jan 24, 2017 at 4:37 AM, jeffrey9909 [email protected] wrote:

Just a question, how is the latent_dim determined to be 292? I am trying to modify the code by myself the moment, but I have not idea about this... Thanks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/issues/56#issuecomment-274754206, or mute the thread https://github.com/notifications/unsubscribe-auth/AFGDhopCRjBO5c1S3l6GUAjgZKHXMBaKks5rVcZjgaJpZM4LpHaM .

Jan 24 '17 16:01 pechersky

keras-molecules keras-molecules copied to clipboard

Replace SMILES input to Coulomb matrix

keras-molecules
keras-molecules copied to clipboard