deep-review icon indicating copy to clipboard operation
deep-review copied to clipboard

Generative adversarial networks simulate gene expression and predict perturbations in single cells

Open pierremac opened this issue 5 years ago • 3 comments

https://doi.org/10.1101/262501

Recent advances have enabled gene expression profiling of single cells at lower cost. As more data is produced there is an increasing need to integrate diverse datasets and better analyse underutilised data to gain biological insights. However, analysis of single cell RNA-seq data is challenging due to biological and technical noise which not only varies between laboratories but also between batches. Here for the first time, we apply a new generative deep learning approach called Generative Adversarial Networks (GAN) to biological data. We apply GANs to epidermal, neural and hematopoietic scRNA-seq data spanning different labs and experimental protocols. We show that it is possible to integrate diverse scRNA-seq datasets and in doing so, our generative model is able to simulate realistic scRNA-seq data that covers the full diversity of cell types. In contrast to many machine-learning approaches, we are able to interpret internal parameters in a biologically meaningful manner. Using our generative model we are able to obtain a universal representation of epidermal differentiation and use this to predict the effect of cell state perturbations on gene expression at high time-resolution. We show that our trained neural networks identify biological state-determining genes and through analysis of these networks we can obtain inferred gene regulatory relationships. Finally, we use internal GAN learned features to perform dimensionality reduction. In combination these attributes provide a powerful framework to progress the analysis of scRNA-seq data beyond exploratory analysis of cell clusters and towards integration of multiple datasets regardless of origin.

pierremac avatar Sep 06 '18 10:09 pierremac

Summary

Authors use a GAN on scRNA-seq data (epidermal, neural and hematopoietic cells from different experiments) and show the simulated cell overlap with the real ones on t-SNE visualization. Then, they use the latent representation, to simulate cellular perturbations (i.e. basal to differentiated cells in their experiments) by interpolating between two points in the latent space and generating the corresponding cells. They perform sensitivity analysis on the discriminator network and use the result to identify the marker genes that are relevant to the GAN representation. They highlight that some of those identified genes are markers already known in the literature. They also use the weights in the last layer of the generator network to study the linear dependencies (correlation) that are expressed in the GAN representation, highlighting some biologically relevant dependencies that are not unraveled by classical expression analysis methods. Finally, the also use the features learnt in the (unique) hidden layer of the critic network as a dimensionality reduction technique and show that surprisingly, those features seem invariant to batch effects and yet seem to preserve interesting biological properties (different cell types cluster separately in t-SNE representation computed on top of those "GAN critic features", for instance).

Computational Methods

In the paper, they describe a Wasserstein GAN (with gradient penalty term to enforce the Lipschitz constraint) with fully connected networks with a unique hidden layer (600 neurons for the generator, 200 for the critic), and a latent space of dimension 100. They use Leaky ReLU activation functions (with 0.2 coefficient). Interestingly they use an additive mixture of a Gaussian and a Poisson distribution for the latent noise. They use RMSProp to optimize the GAN with a batch size of 32. However, they also provide a link to a git repo, which does not match those specifications (it's a classic GAN with 2 layers in the generator, using a modified version of ADAM to stabilize the training). I think that implementation is outdated and has not been updated to match this version of their paper.

Personal take on the paper

I'm a bit critical about some claims in the paper. For instance, they claim in the abstract that " In contrast to many machine-learning approaches, we are able to interpret internal parameters in a biologically meaningful manner". To my understanding, this is a reference to their sensitivity analysis which is interesting and indeed gives some interesting and relevant insights, but is also a bit rough and does not guarantee to capture the most important features. I don't agree with their claim that the reason why their simulated cells don't represent the full variability of the real cells is that the generator uses continuous inputs while the gene expression distributions are discrete. It is probably a minor detail but reflects that there might be some "over-statements" in the manuscript. I very much like the part about the differentiation. However, it also contains what I think is the main technical weakness of their paper. It is not possible to directly map a cell to a point in the latent space with a GAN. To overcome this limitation, they randomly simulate cells until they find one that is similar enough to the one they wanted to map (and then use the coordinates of the corresponding latent code as the map). Overall, though, it is a very interesting and stimulating paper in my opinion.

pierremac avatar Sep 06 '18 10:09 pierremac

Thanks for the contribution @pierremac. I edited the original post to use the DOI link https://doi.org/10.1101/262501, which makes it easier for us to add citations with Manubot.

agitter avatar Sep 06 '18 11:09 agitter

Alright, let me know if some other adjustments are required!

pierremac avatar Sep 06 '18 11:09 pierremac