deep-review icon indicating copy to clipboard operation
deep-review copied to clipboard

Kipoi: accelerating the community exchange and reuse of predictive models for genomics

Open evancofer opened this issue 5 years ago • 4 comments

Advanced machine learning models applied to large-scale genomics datasets hold the promise to be major drivers for genome science. Once trained, such models can serve as a tool to probe the relationships between data modalities, including the effect of genetic variants on phenotype. However, lack of standardization and limited accessibility of trained models have hampered their impact in practice. To address this, we present Kipoi, a collaborative initiative to define standards and to foster reuse of trained models in genomics. Already, the Kipoi repository contains over 2,000 trained models that cover canonical prediction tasks in transcriptional and post-transcriptional gene regulation. The Kipoi model standard grants automated software installation and provides unified interfaces to apply and interpret models. We illustrate Kipoi through canonical use cases, including model benchmarking, transfer learning, variant effect prediction, and building new models from existing ones. By providing a unified framework to archive, share, access, use, and build on models developed by the community, Kipoi will foster the dissemination and use of machine learning models in genomics.

https://doi.org/10.1101/375345

Kipoi (not the paper) was previously mentioned in #837

evancofer avatar Jul 26 '18 15:07 evancofer

Goals

  • Develop a public repository for machine learning models for genomics.

Main points

  • Kipoi enables the standardization of workflows for developing, interpreting, and benchmarking machine learning models for genomics.
  • Kipoi provides key functionalities for machine learning models in genomics. In particular, it allows:
    • Developing new models and retraining/combining/transfer learning with existing models
    • Making predictions with trained models
    • Predicting the effects of genomic variants
    • Model analysis and interpretation
  • Kipoi uses continuous integration to continuously update performance benchmarks for models.
  • Kipoi workflows can use both R and Python code.

Pros

  • By providing a standardized API to manage and develop models produced otherwise-disparate machine learning frameworks, Kipoi allows for centralized storage of these models.
  • Kipoi's unified API allows for existing models from different frameworks to be compared and used together without intimate knowledge of all of the underlying frameworks. This could reduce the knowledge barrier required to benchmark or compare models on new datasets.
  • Kipoi makes it easy to apply several widely-used algorithms (e.g. in silico mutagenesis, DeepLift, saliency maps) for interpreting and analyzing the features learned by trained models.
  • Although most of the models are deep learning models, Kipoi provides support for "shallow" models as well.
  • The data loader class is very flexible, and can be adapted to problems with very different data formats (e.g., transcription factor binding sites and chromosome contact maps).
  • The manuscript provides a detailed overview of the code and parameters to reproduce the paper, and include a repository with their code here
  • They provide extensive documentation and tutorials here.
  • In addition to illustrated basic use cases (e.g. training a model, benchmarking a trained model), they provide examples of more advanced usage scenarios:
    • transfer learning to predict chromatin accessibility.
    • combining multiple models to predict pathogenicity of splicing variants.
  • While there are some issues with Kipoi as a primary archive for published models (see below), it draws attention to the importance of standarized and indexed archival of machine learning models for genomics.
  • The authors mention that Kipoi may eventually be used to host Kaggle-like prediction challenges in genomics.

Cons

  • It is unclear how Kipoi addresses software dependency management challenges impeding widespread model distribution in ways not already addressed by the existing dependency management systems that it uses (e.g. Anaconda).
  • It is unknown if Kipoi can be used with popular hyperparameter optimization packages (e.g. hyperopt)
  • The authors currently use GitHub for storage and archival, which has the following limitations:
  • The author information for models from existing manuscripts is entered manually. This has led to some incorrect or incomplete author information (e.g., DeepCpG_DNA only includes on author, DeepSEA lists the corresponding author first instead of the first author). It seems like this could be fetched programatically.

evancofer avatar Aug 01 '18 15:08 evancofer

Hi Evan,

thanks for reviewing! Let me comment on the cons points.

It is unclear how Kipoi addresses software dependency management challenges impeding widespread model distribution in ways not already addressed by the existing dependency management systems that it uses (e.g. Anaconda).

The existing dependency management systems (e.g. pip and conda) need to be properly used and their installations continuously tested to guarantee easy usage. This is addressed in two steps: (i) Kipoi requires the user to specify the required dependencies in a consistent manner (e.g. not as part of a free-text README file). (ii) Once per day and upon a pull-request, Kipoi automatically installs the dependencies and runs model predictions. Hence, it makes sure that the specified dependencies indeed work as expected. Without these tests, it can happen that an update of a particular package version (if not fixed to a particular version) might break the code. Why don't we freeze all package versions for each model/dataloader? Doing so would prevent using multiple models in the same conda environment and would hence restrict the user.

It is unknown if Kipoi can be used with popular hyperparameter optimization packages (e.g. hyperopt)

Hyperparameter optimization packages including Hyperopt require the user to specify an objective function returning the loss. In this objective function, arbitrary python code can be executed hence also a model can be loaded from Kipoi and fine-tuned on a new dataset. Here is a sketch of how I would implement transfer-learning using hyperopt via kopt (https://github.com/Avsecz/kopt, hyperopt wrapper for training Keras models):

import kipoi
from kopt import CompileFN, KMongoTrials, test_fn
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials

def data():
    """Returns the whole training set. Generator/iterator support on my TODO
    """
    Dl = kipoi.get_dataloader_factory("my_model")
    train = Dl(fasta_file="path.fa", intervals_file="path.tsv").load_all()
    test = Dl(fasta_file="path.fa", intervals_file="test_path.tsv").load_all()
    return (train['inputs'], train['targets']), (test['inputs'], test['targets'])

def model(train, transfer_to, lr=0.001, base_model='Divergent421'):
    model = kipoi.get_model(base_model)
    # Transferred part
    tmodel = Model(model.model.inputs,
                   model.model.get_layer(transfer_to).output)
    # New part
    top_model = Sequential([kl.Dense(args.tasks,
                                    activation="sigmoid",
                                    input_shape=tmodel.output_shape[1:]))
    # Stack
    final_model = Sequential([tmodel, top_model])
    final_model.compile(Adam(lr), "binary_crossentropy", ['acc'])
    return final_model

# Specify the optimization metrics
db_name="kipoi"
exp_name="model1"
objective = CompileFN(db_name, exp_name,
                      data_fn=data,
                      model_fn=model,
                      loss_metric="acc", # which metric to optimize for
                      loss_metric_mode="max",  # try to maximize the metric
                      valid_split=.2, # use 20% of the training data for the validation set
                      save_model='best', # checkpoint the best model
                      save_results=True, # save the results as .json (in addition to mongoDB)
                      save_dir="./saved_models/")  # place to store the models

# define the hyper-parameter ranges
# see https://github.com/hyperopt/hyperopt/wiki/FMin for more info
hyper_params = {
	"data": {},
	"model": {
            "lr": hp.loguniform("m_lr", np.log(1e-4), np.log(1e-2)), # 0.0001 - 0.01
            "transfer_to": hp.choice("m_tt", ("dense_1", "dense_2")),  # Transfer different number of layers
            "base_model": "Divergent421",
	},
	"fit": {
	    "epochs": 20
	}
}

# test model training, on a small subset for one epoch
test_fn(objective, hyper_params)

# run hyper-parameter optimization
trials = KMongoTrials(db_name, exp_name,
                      ip="localhost",
	              port=22334)
best = fmin(objective, hyper_params, trials=trials, algo=tpe.suggest, max_evals=100)

The authors currently use GitHub for storage and archival, which has the following limitations:

  • Commits can be easily purged, making it unsuitable as a primary archive publication resources
  • According to GitHub, "Git is not adequately designed to serve as a backup tool"

We disabled the option to force push hence commits can't be purged. We are not using GitHub for backups in the canonical sense (e.g backing up a PC disk) for which they list specifically designed solutions like CrashPlan in the article. However, I agree that Git LFS has storage limits and we should consider other alternatives in the future. Regarding archiving, one idea would be to deposit a model to Zendoo and then make a pull-request with a link to the kipoi/models repository (started an issue here). This would make the model directly citable with a doi link.

The author information for models from existing manuscripts is entered manually. This has led to some incorrect or incomplete author information (e.g., DeepCpG_DNA only includes on author, DeepSEA lists the corresponding author first instead of the first author). It seems like this could be fetched programatically.

I totally agree with you. We should be consistent and include all authors from the manuscript as the model authors (DeepCpG_DNA). DeepSEA correctly lists the first author first in the YAML file, however I just noticed a bug which may swap the author list when summarizing the authors from multiple models (set usage in https://github.com/kipoi/kipoi/blob/master/kipoi/sources.py#L170-L171). I've just made a PR fixing this. Programmatic author fetching could be an interesting solution or validation. Do you know a tool / web API that, given a doi link, returns the manuscript information including the author names?

Avsecz avatar Aug 02 '18 16:08 Avsecz

Thank you for the response. I agree that Zenodo is probably a good long-term solution for archival. I think that crossref can be used for retrieving manuscript information, and I assume that there are a number of publisher-specific APIs to lookup manuscript information for a given doi. I believe that doi2bib does this pretty well, and have their web app code in a public github repo.

evancofer avatar Aug 02 '18 17:08 evancofer

Do you know a tool / web API that, given a doi link, returns the manuscript information including the author names?

@Avsecz I recommend checking out @dhimmel's Manubot Python package for this. It was originally developed as part of the collaborative writing platform that we used to write this review manuscript (deep review). Now it is a standalone package that can take a DOI, arXiv id, PubMed id, PubMed Central id, or URL and return structured citation information, including authors.

agitter avatar Aug 02 '18 18:08 agitter