GLUE icon indicating copy to clipboard operation
GLUE copied to clipboard

Does LSI require all peaks for optimal results, especially when dealing with ultra-scale datasets?

Open YH-Zheng opened this issue 7 months ago • 5 comments

Hello, I currently have scATAC data with approximately 3.43 million cells and around 160,000 peaks. When I attempt LSI dimensionality reduction using all peaks, it takes an incredibly long time (seemingly more than a day, which I eventually terminated).

However, when I use guidance to map highly variable genes from RNA to ATAC, involving 15,868 highly variable peaks, LSI takes less time, and I successfully complete the model training. The final cell type transfer seems to work well, but when I visualize the merged ATAC and RNA, I notice that the cell subtypes aren't completely separated, unlike in the downsampled ATAC dataset. I wonder if this is due to the use of highly variable peaks.

As for training with RNA data, my dataset is also large. Currently, I'm employing random downsampling. Do you have any suggestions for handling such ultra-scale datasets?

YH-Zheng avatar Nov 29 '23 16:11 YH-Zheng