Gemma Deal with all single-cell assays missing values in DEA

Single-cell subsets that are all zero for a given gene should not be analyzed.

This is usually achieved by weeding out low variance genes, but in the case of single-cell, that filter does not work because of the variance introduced by the library size.

A possible fix would be to set those to NAs when aggregating.

Mar 21 '25 18:03 arteymix

I'm reopening this because the solution was not satisfactory. I think we need more internal discussion before going forward.

Mar 21 '25 21:03 arteymix

Just outlining the issue here.

A non-detected gene is one that is in the RNA-seq pipeline's reference, but for which there are no reads assigned (in some particular range of cells/samples).

There are multiple stages where non-detected genes might be filtered.

When we first import read data, prior to any cell type annotation or normalization, genes which have zero reads in any cell should be filtered out for any subsequent steps. Possibly on import. There is no value to carrying these zeros around.

The next stage is after cell type annotation. There will be genes which are detected in only some cell types. This means prior to next steps, especially DEA, we have to deal with them. We must not do this at the subject level, but across all subjects. Any gene which is not expressed at all in a given cell type (again, considering all subjects) should be filtered out. Of course it is common to filter out genes that are "very low", not just zero, but the most important thing is to remove the non-detected genes.

Because of where we do the library size normalization in this process, it might require some changes to track which genes should be excluded based on counts. Log transformation, turns zeros into another value, which is a problem. This will be the lowest value observed in the normalized data, but it would be better to base the filtering on the counts.

The core principle is that zero is not missing. A gene that is expressed in only one subject would be zero in the other subjects, not missing data.

DEA should eliminate genes that are "very low", or at least "zero variance", but it's better to leave genes in than remove them. This is why the genes I've always focused on are the ones that are truly not worth analyzing - and the obvious cases are the all-zero genes.

There should be no missing data in RNA-seq, just zeros, with the important exception where a sample lacks cells (or enough cells) of a given type to be able to make a pseudbulk, or is flagged as an outlier during QC. All the data for such samples would be considered missing (NA), for that cell type.

Mar 21 '25 21:03 ppavlidis

The issue with single cell data as compared to bulk is that the variation of library size per cell type is much higher, so our filter for low variance does not exclude repetitive values.

A zero encoded by log2cpm is:

(0.5 + 0) / (library size + 1)

If the library size is similar between assays, a zero gene will have a small variance and thus will be eliminated by our filter.

This is not happening for single-cell.

The filtering needs to be aware of the cell type. If a gene is zero for all pseudo-bulks of a given cell type, we should fill it with NAs so that it doesn't end up analyzed for that particular subset.

My mistake in my first pass was to set an assay to NA if it has no cell for a given gene, but that should be considered like an actual zero and should be included in a DEA, especially if the other samples have expression for that gene.

Mar 22 '25 20:03 arteymix