RiboDiff icon indicating copy to clipboard operation
RiboDiff copied to clipboard

Many genes with padj=NaN despite removing low-expressed genes

Open evaesquinas opened this issue 2 years ago • 0 comments

Dear Yi Zhong (@yizhong),

I am trying to use Ribodiff with 6 samples of RNA-seq and 6 samples of Ribo-Seq (unpaired data), with 30560 genes. When I run the tool for the first time (no filtering, just raw data), I got the following number of genes whose padj was:

  • NaN: 17260
  • Significant (<= 0.05): 16.

While I was asking myself why I got such large amount of genes with NaN, I found this issue and I decided to try to filter my data and remove the low expressed genes in 3 ways that I am going to explain later. In all the cases, after filtering, genes with NaN decreased but the number of significant genes did not increase, in fact, some were lost.

  • 1st attempt: rowSums(data)>=10
  • 2nd attempt: rowSums(data_RNAseq)>=10 & rowSums(data_RiboSeq)>=10 (getting the common genes that passed the filter)
  • 3rd attempt: remove any gene that has 0 --> data %>% filter_at(vars(-Entry), all_vars(. != 0))

The results were the following:

  • 1st attempt: rowSums(data)>=10 ○ Total number of genes: 15378 - pAdj = NaN: 2088 - pAdj <= 0.05: 18 □ From those 18: - 15 remained if we compare without filtering. - Values of columns have changed. - 3 new genes have appeared with this filtering. - 1 gene is lost if we compare the results without filtering.

  • 2nd attempt: rowSums(data_RNAseq)>=10 & rowSums(data_RiboSeq)>=10 (getting the common genes that passed the filter) ○ Total number of genes: 13346 - pAdj = NaN: 92 - pAdj <= 0.05: 17 □ From those 17: - 16 remained if we compare the previous filtering (rowSums > 10 for RNA + Ribo-Seq). - Values of columns have changed. - 1 new gene has appeared with this filtering. - 2 genes were lost from the previous filtering (rowSums > 10 for RNA + Ribo-Seq).

  • 3rd attempt: remove any gene that has 0 --> data %>% filter_at(vars(-Entry), all_vars(. != 0)) ○ Total number of genes: 11862 - pAdj = NaN: 11 - pAdj <= 0.05: 12. □ From those 12: - 12 remained from the previous filtering (rowSums > 10 RNA & rowSums > 10 Ribo-seq) - 5 were lost from the previous filtering (rowSums > 10 RNA & rowSums > 10 Ribo-seq)

Note that the data that I am using contains all the genes (protein coding, lncRNA, miRNAs, etc.). However, as it is recommended in the manual to use only protein coding genes ("Only including protein coding genes may give a better global estimation."), I have also run the tool just with those genes. (Just a parenthesis: Could you please notice me why it is recommended to use only protein coding genes?)

Using data that contains only protein coding genes.... the results that I wrote before are quite similar except that:

  • I do not lose genes from the previous filterings (if a gene is included in the previous filtering, in the next filtering will be included too)
  • I get new genes that they were not included before.
  • I end having not significant genes at all (with the 3rd attempt)

All these attemps let me to think that regardless of removing the low-expressed genes my results do not improve. In fact, with the last filter, I can lose all the significant genes that I had before (using only protein coding genes). And... it also makes me wonder if I can trust in those genes that I recover that were not significant before...

What is more, using the data from the 3rd attempt, I am getting the warnings that were published in the following issues: https://github.com/ratschlab/RiboDiff/issues/9 & https://github.com/ratschlab/RiboDiff/issues/7

And new ones that I have not seen in my previous attempts:

/home/usc/gr/eer/.local/lib/python2.7/site-packages/RiboDiff-0.2.1-py2.7.egg/ribodiff/adjdisp.py:58: RuntimeWarning: invalid value encountered in log
  logResidule = np.log(dispRaw[dispFittedIdx]) - np.log(dispFitted[dispFittedIdx])
/mnt/netapp1/Optcesga_FT2_RHEL7/2020/software/Compiler/gcccore/system/python/2.7.18/lib/python2.7/site-packages/numpy/lib/function_base.py:3405: RuntimeWarning: Invalid value encountered in median
  r = func(a, **kwargs)

Needless to say that I have followed the manual and the format of my data is just as it is required.

I am using Python 2.7.18 and:

  • numpy 1.16.6.
  • scipy 1.2.3
  • matplotlib 2.2.5
  • statsmodel 0.10.2

Could you send me some help or thoughts about what it could be happening, please?

Thanks very much in advance

Regards

evaesquinas avatar Jul 28 '23 16:07 evaesquinas