SUPPA icon indicating copy to clipboard operation
SUPPA copied to clipboard

Recommendations for differential testing with sparse data and different sample sizes

Open fairliereese opened this issue 4 years ago • 3 comments

Hi,

I am trying to quantify the difference in intron retention in a dataset I have that's comprised of single cells and single nuclei. I am currently treating each cell or nucleus as its own sample. I have 158 nuclei and 112 cells.

I am getting an error that is seems some others are also getting:

ERROR:__main__:Unknown error: (<class 'UnboundLocalError'>, UnboundLocalError("local variable 'i' referenced before assignment"), <traceback object at 0x2b9444247b40>)

My call is

suppa.py diffSplice \
    -m emperical \
    -p nuc.psi cell.psi \
    -e nuc_tpm.tsv cell_tpm.tsv \
    -i c2c12_RI_strict.ioe\
    -gc \
    -o diff

I haven't been able to quite pin down my problem by looking at the other issues here.

@102 suggests it's a mismatch between sample IDs across the .psi files and the tpm files. Heading the files that I input to diffSplice determines this is not the case:

116 $ head -1 cell.psi > temp1
(base) 
Mon Dec 28 18:00:09 [112.71 112.57 112.51]  freese@hpc-login-1-2:~/mortazavi_lab/data/c2c12_paper_2020/sc_pacbio/201218/cells_v_nuclei/suppa
1117 $ head -1 cell_tpm.tsv > temp2
(base) 
Mon Dec 28 18:00:18 [112.60 112.55 112.51]  freese@hpc-login-1-2:~/mortazavi_lab/data/c2c12_paper_2020/sc_pacbio/201218/cells_v_nuclei/suppa
1118 $ diff temp1 temp2
(base) 
Mon Dec 28 18:00:21 [112.60 112.55 112.51]  freese@hpc-login-1-2:~/mortazavi_lab/data/c2c12_paper_2020/sc_pacbio/201218/cells_v_nuclei/suppa
1119 $ head -1 nuc.psi > temp1
(base) 
Mon Dec 28 18:00:29 [112.58 112.55 112.51]  freese@hpc-login-1-2:~/mortazavi_lab/data/c2c12_paper_2020/sc_pacbio/201218/cells_v_nuclei/suppa
1120 $ head -1 nuc_tpm.tsv > temp2
(base) 
Mon Dec 28 18:00:33 [112.53 112.54 112.50]  freese@hpc-login-1-2:~/mortazavi_lab/data/c2c12_paper_2020/sc_pacbio/201218/cells_v_nuclei/suppa
1121 $ diff temp1 temp2

I also see in @64 that there are two problems that I might have that are incompatible with SUPPA.

Firstly, apparently SUPPA may not support having different numbers of samples per condition (though this doesn't seem to be explicitly stated). Since I have 158 nuclei and 112 cells, this could be a problem for me. Does SUPPA truly not work on conditions with a different number of samples?

Secondly, it seems like the user in @64 had issues with mostly zeroes in the output. As I'm working with an extremely sparse dataset, I am worried that this is a problem for me. Could this be the source of my issues?

I am running Python 3.7.6.

Thanks, Fairlie

fairliereese avatar Dec 29 '20 02:12 fairliereese

Hi Fairlie,

thanks for your email, and for investigating already some of the potential issues.

I think the problem is the sparse data. That error is common when SUPPA cannot build the two empirical sets to perform the differential splicing test.

There are various alternative options.

You could use the -m classical comparison instead. This will do a Mann-Whitney test between your samples. This is also more advisable when working with so many samples per condition.

Although the matrix is sparse, you can still allow the use of nan's. You can control how many per row you allow with the -nan option https://github.com/comprna/SUPPA#command-and-options-1 This will be important to ensure that the default is not killing most of your events and returning no cases.

Another alternative is to take the whole matrix and performs regression with a linear model. This is not part of SUPPA, but we have used it in a work we are currently finishing and it is very useful to control for other potential variables within each sample group.

SUPPA does not have any problem using a different number of samples per group. You pass the file for each group separately, and it will make the comparisons between them regardless of the number of samples per group. You only need at least two per group. However, the number of samples per group must be the same if you want to use a paired-test.

I hope this helps

Thanks

Eduardo

On Tue, 29 Dec 2020 at 13:19, Fairlie Reese [email protected] wrote:

Hi,

I am trying to quantify the difference in intron retention in a dataset I have that's comprised of single cells and single nuclei. I am currently treating each cell or nucleus as its own sample. I have 158 nuclei and 112 cells.

I am getting an error that is seems some others are also getting:

ERROR:main:Unknown error: (<class 'UnboundLocalError'>, UnboundLocalError("local variable 'i' referenced before assignment"), <traceback object at 0x2b9444247b40>)

My call is

suppa.py diffSplice
-m emperical
-p nuc.psi cell.psi
-e nuc_tpm.tsv cell_tpm.tsv
-i c2c12_RI_strict.ioe
-gc
-o diff

I haven't been able to quite pin down my problem by looking at the other issues here.

@102 https://github.com/102 suggests it's a mismatch between sample IDs across the .psi files and the tpm files. Heading the files that I input to diffSplice determines this is not the case:

116 $ head -1 cell.psi > temp1 (base) Mon Dec 28 18:00:09 [112.71 112.57 112.51] freese@hpc-login-1-2:~/mortazavi_lab/data/c2c12_paper_2020/sc_pacbio/201218/cells_v_nuclei/suppa 1117 $ head -1 cell_tpm.tsv > temp2 (base) Mon Dec 28 18:00:18 [112.60 112.55 112.51] freese@hpc-login-1-2:~/mortazavi_lab/data/c2c12_paper_2020/sc_pacbio/201218/cells_v_nuclei/suppa 1118 $ diff temp1 temp2 (base) Mon Dec 28 18:00:21 [112.60 112.55 112.51] freese@hpc-login-1-2:~/mortazavi_lab/data/c2c12_paper_2020/sc_pacbio/201218/cells_v_nuclei/suppa 1119 $ head -1 nuc.psi > temp1 (base) Mon Dec 28 18:00:29 [112.58 112.55 112.51] freese@hpc-login-1-2:~/mortazavi_lab/data/c2c12_paper_2020/sc_pacbio/201218/cells_v_nuclei/suppa 1120 $ head -1 nuc_tpm.tsv > temp2 (base) Mon Dec 28 18:00:33 [112.53 112.54 112.50] freese@hpc-login-1-2:~/mortazavi_lab/data/c2c12_paper_2020/sc_pacbio/201218/cells_v_nuclei/suppa 1121 $ diff temp1 temp2

I also see in @64 https://github.com/64 that there are two problems that I might have that are incompatible with SUPPA.

Firstly, apparently SUPPA may not support having different numbers of samples per condition (though this doesn't seem to be explicitly stated). Since I have 158 nuclei and 112 cells, this could be a problem for me. Does SUPPA truly not work on conditions with a different number of samples?

Secondly, it seems like the user in @64 https://github.com/64 had issues with mostly zeroes in the output. As I'm working with an extremely sparse dataset, I am worried that this is a problem for me. Could this be the source of my issues?

I am running Python 3.7.6.

Thanks, Fairlie

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/116, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB3NHBSA4TJFS7C4HN3SXE4ELANCNFSM4VMWGD2Q .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

EduEyras avatar Dec 30 '20 02:12 EduEyras

Thanks so much for the in depth response. I'll let you know if these suggestions work!

fairliereese avatar Dec 30 '20 02:12 fairliereese

Hi Fairlie,

Just a suggestion here, but I noticed in your code "-m emperical". This is a spelling mistake, and should be empirical.

Good luck!

danphillips28 avatar Jan 24 '21 14:01 danphillips28