PIDGINv3
PIDGINv3 copied to clipboard
[reproducibility] (Re-)generation of biosystems.txt and DisGeNET_diseases.txt
The documentation says that this file was created from ChEMBL 24, PubChem, and DisGeNet . There have been several releases since with more data, which could improve the goodness and utility of your models. However, it's not clear how these resource files were created. To assess the correctness of the work, it would also be necessary to show that the pipeline for getting data is not only reproducible, but makes sense. Seeing the code that does this gives insights into the special cases that might have been encountered and how they're handled, that would make your data output different from one that somebody would make by following your work as a guide, but without access to your code.
This should also apply to the two resources that you ask the user to download.
Caveat: While ChEMBL has versioned downloads, PubChem's rolling release only allows for the download of the most recent months/days. I'm not sure about DisGeNet. I know this might make it impossible to reproduce the generation of the exact datasets, which is why it's also good to have the dumps in this repo, so thanks for that.