moleculenet
moleculenet copied to clipboard
Two Submissions on Clearance
@rbharath @miaecle This PR is for two submissions (random forest + ECFP & GCN + GC) on Clearance.
Also, it seems that the dataset is small and the labels can have a very different scale, e.g. 0.xx to 22. As a result, the RMSE values are pretty large. See if this is expected. @peastman
Also, it seems that the dataset is small and the labels can have a very different scale, e.g. 0.xx to 22. As a result, the RMSE values are pretty large.
I'm not too familiar with this dataset. That does make sense. Perhaps a different metric would be more appropriate?
Also, it seems that the dataset is small and the labels can have a very different scale, e.g. 0.xx to 22. As a result, the RMSE values are pretty large.
I'm not too familiar with this dataset. That does make sense. Perhaps a different metric would be more appropriate?
What's the source of the dataset? Have anyone used this before? An alternative metric can be R2.
Sorry for the slow response! Lost track of this PR in my inbox. It looks like we added the clearance dataset in https://github.com/deepchem/deepchem/pull/484 but we don't have the dataset listed in the original 17 datasets in MoleculeNet v1 for some readon. @miaecle would you happen to remember why we didn't add clearance to the moleculenet v1 datasets?
As a couple of thoughts, perhaps we should log-transform the output? We do this for some regression outputs in which there's a large range of outputs. In that case, the RMS on the logarithmic scale might be meaningful. Another option is swapping to R^2. I'm pretty open to swapping to either given that we didn't include Clearance in v1 so this won't break any existing benchmark standard
@mufeili @rbharath Sorry I didn't quite remember why/if it is included in the initial version of benchmark. As of metrics I agree with Bharath on log-transforming. Depending on how the label distribution looks like, R2 could also suffer from outliers (assuming those with label~22 are quite rare).