PIDGINv3 icon indicating copy to clipboard operation
PIDGINv3 copied to clipboard

[reproducibility] (Re-)generation of biosystems.txt and DisGeNET_diseases.txt

Open cthoyt opened this issue 4 years ago • 0 comments
trafficstars

The documentation says that this file was created from ChEMBL 24, PubChem, and DisGeNet . There have been several releases since with more data, which could improve the goodness and utility of your models. However, it's not clear how these resource files were created. To assess the correctness of the work, it would also be necessary to show that the pipeline for getting data is not only reproducible, but makes sense. Seeing the code that does this gives insights into the special cases that might have been encountered and how they're handled, that would make your data output different from one that somebody would make by following your work as a guide, but without access to your code.

This should also apply to the two resources that you ask the user to download.

Caveat: While ChEMBL has versioned downloads, PubChem's rolling release only allows for the download of the most recent months/days. I'm not sure about DisGeNet. I know this might make it impossible to reproduce the generation of the exact datasets, which is why it's also good to have the dumps in this repo, so thanks for that.

cthoyt avatar Feb 08 '21 13:02 cthoyt