beta-lactamase icon indicating copy to clipboard operation
beta-lactamase copied to clipboard

Number of molecular descriptors obtained with PaDEL differs from the number of molecules in the molecule.smi file

Open sayalaruano opened this issue 3 years ago • 7 comments

Hello professor, I’m doing EDA and calculation of molecular descriptors of the betalactamase dataset. I replaced duplicated values by the mean of them as you suggested, and filtered only molecules that bind to Betalactamase AmpC, and I have a dataset with 62050 molecules. Then, I followed instructions to calculate molecular descriptors with paDELpy from the video of description, but I obtained molecular descriptors of only 5534 molecules although my molecule.smi file has 62050 molecules. Do you know if there are restrictions regarding the number of molecules for calculating descriptors in paDEL ? or this error can be associated with something from my code ?. This GitHub repo contains my notebook and all files: https://github.com/sayalaruano/MidtermProject-MLZoomCamp. I added the same comment in the youtube video of the challenge, just in case. Thanks in advance for your help.

sayalaruano avatar Oct 24 '21 14:10 sayalaruano

I obtained 1412 rows myself as can be seen here: https://github.com/wguesdon/beta-lactamase/blob/main/Data_Wrangling_and_EDA.ipynb. I wonder if we could apply the padelpy method row by row via a lambda function?

wguesdon avatar Oct 24 '21 15:10 wguesdon

I just come up with the solution for this error. The mistake was that I maintain in my dataset some molecules with NaN in canonical smile feature, so padel only calculate fingerprints for molecules above the first NaN. Now, I will try to calculate the 12 fingerprints for all molecules. I hope I can calculate all of them.

sayalaruano avatar Oct 24 '21 16:10 sayalaruano

Thank you for sharing, it must have been the same issue for me.

wguesdon avatar Oct 24 '21 16:10 wguesdon

You're welcome @wguesdon, this is the good part of these collaborative projects :)

sayalaruano avatar Oct 24 '21 16:10 sayalaruano

Hello sayalaruano,

I have the same problem. I obtained molecular descriptors of PubChem only 338 molecules although my molecule.smi file has 64424 molecules.

semsem80 avatar Oct 29 '21 20:10 semsem80

Hello @semsem80 , to solve this error, you need to delete molecules with NaN in canonical_smile feature. In this way, you can solve this problem. Hope this can be helpful, let me know if it works.

sayalaruano avatar Oct 30 '21 03:10 sayalaruano

Hi @sayalaruano, your suggested solution worked, thank you for your help.

semsem80 avatar Oct 31 '21 21:10 semsem80