latent-gan
latent-gan copied to clipboard
Encoding new unseen molecules
Hi. When trying to create 512 dimensional vector representations of some new molecules (that the encoder may not have seen during training), I get the following error
Traceback (most recent call last):
File "encode.py", line 56, in
I am using the pretrained chembl encoder. Any ideas about how to resolve? Thanks
Did you find a solution to this?
Because they explicitly mention in the README that the token length limit is 128, I decided to use SmilesVectorizer
from molvecgen
. I removed all SMILES for which the token vector has a length larger than the limit.
Suppose your data frame is called data
in the example below.
remove = []
TOKEN_LENGTH_LIMIT = 128
for index, row in tqdm(data.iterrows(), total=len(data)):
mol = Chem.MolFromSmiles(row.SMILES)
sm_en = SmilesVectorizer(canonical=True, augment=False)
sm_en.fit([mol], extra_chars=["\\"])
if sm_en.maxlength > TOKEN_LENGTH_LIMIT:
remove.append(index)
print(
f"There are {len(remove)} smiles with a token length larger than {TOKEN_LENGTH_LIMIT}"
)
data.drop(remove, inplace=True)
data.to_csv("preprocessed.csv", index=False, header=False)
And now it worked.
The other way will be that if too many molecules are discarded because they have a token length larger than 128, you retrain the autoencoder again.
Good luck.