deep-review icon indicating copy to clipboard operation
deep-review copied to clipboard

A deep learning framework for imputing missing values in genomic data

Open evancofer opened this issue 5 years ago • 0 comments

Motivation: The presence of missing values is a frequent problem encountered in genomic data analysis. Lost data can be an obstacle to downstream analyses that require complete data matrices. State-of-the-art imputation techniques including Singular Value Decomposition (SVD) and K-Nearest Neighbors (KNN) based methods usually achieve good performances, but are computationally expensive especially for large datasets such as those involved in pan-cancer analysis. Results: This study describes a new method: a denoising autoencoder with partial loss (DAPL) as a deep learning based alternative for data imputation. Results on pan-cancer gene expression data and DNA methylation data from over 11,000 samples demonstrate significant improvement over standard denoising autoencoder for both data missing-at-random cases with a range of missing percentages, and missing-not-at-random cases based on expression level and GC-content. We discuss the advantages of DAPL over traditional imputation methods and show that it achieves comparable or better performance with less computational burden. Availability: https://github.com/gevaertlab/DAPL

https://doi.org/10.1101/406066

evancofer avatar Sep 06 '18 13:09 evancofer