SUPPA icon indicating copy to clipboard operation
SUPPA copied to clipboard

Problems replicating TRA2 tutorial results

Open avilaHugo opened this issue 2 years ago • 3 comments

Dear developers, first of all, thanks for this amazing tool!

I'm writing this because I couldn't replicate the TRA2 results from the tutorial.

I cloned SUPPA (1c0ad91) and built its dependencies with conda (suppa_env.yaml). I didn't use SUPPA version 2.3 available in conda directly because I got this error

ERROR:main:Unknown error: (<class 'UnboundLocalError'>, UnboundLocalError("local variable 'i' referenced before assignment"), <traceback object at 0x179516080>)

. The same didn't happen with the latest (1c0ad91) git code.

The first problem I encountered was when I did the psiPerEvent calculation, the program runs but I get a lot of errors like this.

ERROR:psiCalculator:transcript ENST00000514649 not found in the "expression file".
ERROR:psiCalculator:PSI not calculated for event ENSG00000120949;A3:chr1:12195670-12198286:12195670-12198289:+.
ERROR:psiCalculator:transcript ENST00000529606 not found in the "expression file".
ERROR:psiCalculator:PSI not calculated for event ENSG00000142621;A3:chr1:15695998-15700998:15695998-15701001:+.
ERROR:psiCalculator:transcript ENST00000544435 not found in the "expression file".
ERROR:psiCalculator:PSI not calculated for event ENSG00000162521;A3:chr1:33116923-33117515:33116923-33117518:+.
ERROR:psiCalculator:transcript ENST00000544435 not found in the "expression file".
ERROR:psiCalculator:PSI not calculated for event ENSG00000162521;A3:chr1:33138502-33145236:33138502-33145241:+.
ERROR:psiCalculator:transcript ENST00000484445 not found in the "expression file".
ERROR:psiCalculator:PSI not calculated for event ENSG00000187801;A3:chr1:40915847-40916328:40915847-40916337:+
...

Is this normal ? I used the ensemble references, fasta and gtf, from the tutorial.

After this analysis I managed to generate the plot with the generate_boxplot_event.py script but the graph does not look like the one in the tutorial. Did you do any QC filtering on the reads before the analysis ?

boxplot_TRA2

The big problem is that I couldn't do the step "Differential splicing with local events", the program runs, prints "done" but does not generate the .dpsi table. It generates only two files: "TRA2_diffSplice.psivec" and "TRA2_diffSplice.dpsi.temp.0"..

My system have a CentOS Linux release 7.9 distribution. I used 20 cores and 100 RAM.

Here are all the command lines i used:

# Download files 
parallel-fastq-dump --sra-id SRR1513329 --threads 8 --outdir data/fastq/ --split-files --gzip 
parallel-fastq-dump --sra-id SRR1513330 --threads 8 --outdir data/fastq/ --split-files --gzip 
parallel-fastq-dump --sra-id SRR1513331 --threads 8 --outdir data/fastq/ --split-files --gzip 
parallel-fastq-dump --sra-id SRR1513332 --threads 8 --outdir data/fastq/ --split-files --gzip 
parallel-fastq-dump --sra-id SRR1513333 --threads 8 --outdir data/fastq/ --split-files --gzip 
parallel-fastq-dump --sra-id SRR1513334 --threads 8 --outdir data/fastq/ --split-files --gzip 

# salmon create index:
salmon index -p 20 -t data/ensemble/hg19_EnsenmblGenes_sequence_ensenmbl.fasta -i data/ensemble/index

# suppa extract envents from ensemble

mkdir -p data/ensemble/events_splited && python3 /home/hugo.avila/hugo.avila/repo/SUPPA-2.3/suppa.py generateEvents -i data/ensemble/Homo_sapiens.GRCh37.75.formatted.gtf -o data/ensemble/events_splited/ensemble -e SE SS MX RI FL -f ioe

# suppa merge ensemble events:
bash workflow/scripts/merge_events.sh data/ensemble/events_splited/*.ioe > data/ensemble/ensembl_hg19.events.ioe


# salmon sample quantification:
 salmon quant -i data/ensemble/index -l ISF --gcBias -1 data/fastq/SRR1513334_1.fastq -2  data/fastq/SRR1513334_2.fastq -p 20 -o results/salmon/SRR1513334

salmon quant -i data/ensemble/index -l ISF --gcBias -1 data/fastq/SRR1513329_1.fastq -2  data/fastq/SRR1513329_2.fastq -p 20 -o results/salmon/SRR1513329

salmon quant -i data/ensemble/index -l ISF --gcBias -1 data/fastq/SRR1513330_1.fastq -2  data/fastq/SRR1513330_2.fastq -p 20 -o results/salmon/SRR1513330

salmon quant -i data/ensemble/index -l ISF --gcBias -1 data/fastq/SRR1513332_1.fastq -2  data/fastq/SRR1513332_2.fastq -p 20 -o results/salmon/SRR1513332

salmon quant -i data/ensemble/index -l ISF --gcBias -1 data/fastq/SRR1513331_1.fastq -2  data/fastq/SRR1513331_2.fastq -p 20 -o results/salmon/SRR1513331

salmon quant -i data/ensemble/index -l ISF --gcBias -1 data/fastq/SRR1513333_1.fastq -2  data/fastq/SRR1513333_2.fastq -p 20 -o results/salmon/SRR1513333

# salmon merge tables:
######## OBS: I did not always runned in sorted order: SRR1513329, SRR1513330... .
python3 workflow/scripts/multipleFieldSelection.py -i results/salmon/SRR1513330/quant.sf results/salmon/SRR1513332/quant.sf results/salmon/SRR1513331/quant.sf results/salmon/SRR1513333/quant.sf results/salmon/SRR1513334/quant.sf results/salmon/SRR1513329/quant.sf -k 1 -f 4 -o results/salmon/iso_tpm.txt

# salmon format id:
Rscript workflow/scripts/format_Ensembl_ids.R results/salmon/iso_tpm.txt

# suppa get all samples events:
python3 /home/hugo.avila/hugo.avila/repo/SUPPA-2.3/suppa.py psiPerEvent -i data/ensemble/ensembl_hg19.events.ioe -e results/salmon/iso_tpm_formatted.txt -o results/suppa/TRA2_events

# correct input plot:
# This is a simple oneliner to correct the .psi table to be equal as the one of the tutorial (add EventID header and sort columns).
workflow/scripts/sort_samples.sh results/suppa/TRA2_events.psi > results/suppa/TRA2_events_sorted.psi

# create box plot:
mkdir -p results/suppa/boxplot && workflow/scripts/generate_boxplot_event.py -i results/suppa/TRA2_events_sorted.psi -e 'ENSG00000149554;SE:chr11:125496728-125497502:125497725-125499127:+' -g 1-3,4-6 -c NC,KD -o results/suppa/boxplot

# split by condition:
workflow/scripts/split_file.R results/salmon/iso_tpm_formatted.txt SRR1513329,SRR1513330,SRR1513331 SRR1513332,SRR1513333,SRR1513334 results/suppa/split_conditions/TRA2_NC_iso.tpm results/suppa/split_conditions/TRA2_KD_iso.tpm -i

workflow/scripts/split_file.R results/suppa/TRA2_events.psi SRR1513329,SRR1513330,SRR1513331 SRR1513332,SRR1513333,SRR1513334 results/suppa/split_conditions/TRA2_NC_events.psi results/suppa/split_conditions/TRA2_KD_events.psi -e

# diff splicing analysis:
python3 /home/hugo.avila/hugo.avila/repo/SUPPA-2.3/suppa.py diffSplice -m empirical -gc -i data/ensemble/ensembl_hg19.events.ioe -p results/suppa/split_conditions/TRA2_KD_events.psi results/suppa/split_conditions/TRA2_NC_events.psi -e results/suppa/split_conditions/TRA2_KD_events.psi results/suppa/split_conditions/TRA2_NC_events.psi -o results/suppa/split_conditions/TRA2_diffSplice

suppa_env.yaml.txt salmon_env.yaml.txt results_suppa.zip results_salmon.zip

avilaHugo avatar Apr 22 '22 14:04 avilaHugo

Hi Hugo,

sorry for the delay in the reply.

The error you got could be common if your expression file is truncated or there are transcripts from the event file (.ioe) that do not have any expression.

You still got results, which means that it is not a format issue or a problem with the transcript IDs, I guess.

We do not encounter the error with the diffSplice analysis. Could this be a python version issue?

I hope this helps

E.

On Sat, 23 Apr 2022 at 00:19, Hugo L. Ávila @.***> wrote:

Dear developers, first of all, thanks for this amazing tool!

I'm writing this because I couldn't replicate the TRA2 results from the tutorial.

I cloned SUPPA (1c0ad91) and built its dependencies with conda (suppa_env.yaml). I didn't use SUPPA version 2.3 available in conda directly because I got this error

ERROR:main:Unknown error: (<class 'UnboundLocalError'>, UnboundLocalError("local variable 'i' referenced before assignment"), <traceback object at 0x179516080>)

. The same didn't happen with the latest (1c0ad91) git code.

The first problem I encountered was when I did the psiPerEvent calculation, the program runs but I get a lot of errors like this.

ERROR:psiCalculator:transcript ENST00000514649 not found in the "expression file". ERROR:psiCalculator:PSI not calculated for event ENSG00000120949;A3:chr1:12195670-12198286:12195670-12198289:+. ERROR:psiCalculator:transcript ENST00000529606 not found in the "expression file". ERROR:psiCalculator:PSI not calculated for event ENSG00000142621;A3:chr1:15695998-15700998:15695998-15701001:+. ERROR:psiCalculator:transcript ENST00000544435 not found in the "expression file". ERROR:psiCalculator:PSI not calculated for event ENSG00000162521;A3:chr1:33116923-33117515:33116923-33117518:+. ERROR:psiCalculator:transcript ENST00000544435 not found in the "expression file". ERROR:psiCalculator:PSI not calculated for event ENSG00000162521;A3:chr1:33138502-33145236:33138502-33145241:+. ERROR:psiCalculator:transcript ENST00000484445 not found in the "expression file". ERROR:psiCalculator:PSI not calculated for event ENSG00000187801;A3:chr1:40915847-40916328:40915847-40916337:+ ...

Is this normal ? I used the ensemble references, fasta and gtf, from the tutorial.

After this analysis I managed to generate the plot with the generate_boxplot_event.py script but the graph does not look like the one in the tutorial. Did you do any QC filtering on the reads before the analysis ?

[image: boxplot_TRA2] https://user-images.githubusercontent.com/53014804/164727590-747b4c1b-444c-4f91-bd1f-7c9d66b770b5.png

The big problem is that I couldn't do the step "Differential splicing with local events", the program runs, prints "done" but does not generate the .dpsi table. It generates only two files: "TRA2_diffSplice.psivec" and "TRA2_diffSplice.dpsi.temp.0"..

My system have a CentOS Linux release 7.9 distribution. I used 20 cores and 100 RAM.

Here are all the command lines i used:

Download files

parallel-fastq-dump --sra-id SRR1513329 --threads 8 --outdir data/fastq/ --split-files --gzip parallel-fastq-dump --sra-id SRR1513330 --threads 8 --outdir data/fastq/ --split-files --gzip parallel-fastq-dump --sra-id SRR1513331 --threads 8 --outdir data/fastq/ --split-files --gzip parallel-fastq-dump --sra-id SRR1513332 --threads 8 --outdir data/fastq/ --split-files --gzip parallel-fastq-dump --sra-id SRR1513333 --threads 8 --outdir data/fastq/ --split-files --gzip parallel-fastq-dump --sra-id SRR1513334 --threads 8 --outdir data/fastq/ --split-files --gzip

salmon create index:

salmon index -p 20 -t data/ensemble/hg19_EnsenmblGenes_sequence_ensenmbl.fasta -i data/ensemble/index

suppa extract envents from ensemble

mkdir -p data/ensemble/events_splited && python3 /home/hugo.avila/hugo.avila/repo/SUPPA-2.3/suppa.py generateEvents -i data/ensemble/Homo_sapiens.GRCh37.75.formatted.gtf -o data/ensemble/events_splited/ensemble -e SE SS MX RI FL -f ioe

suppa merge ensemble events:

bash workflow/scripts/merge_events.sh data/ensemble/events_splited/*.ioe > data/ensemble/ensembl_hg19.events.ioe

salmon sample quantification:

salmon quant -i data/ensemble/index -l ISF --gcBias -1 data/fastq/SRR1513334_1.fastq -2 data/fastq/SRR1513334_2.fastq -p 20 -o results/salmon/SRR1513334

salmon quant -i data/ensemble/index -l ISF --gcBias -1 data/fastq/SRR1513329_1.fastq -2 data/fastq/SRR1513329_2.fastq -p 20 -o results/salmon/SRR1513329

salmon quant -i data/ensemble/index -l ISF --gcBias -1 data/fastq/SRR1513330_1.fastq -2 data/fastq/SRR1513330_2.fastq -p 20 -o results/salmon/SRR1513330

salmon quant -i data/ensemble/index -l ISF --gcBias -1 data/fastq/SRR1513332_1.fastq -2 data/fastq/SRR1513332_2.fastq -p 20 -o results/salmon/SRR1513332

salmon quant -i data/ensemble/index -l ISF --gcBias -1 data/fastq/SRR1513331_1.fastq -2 data/fastq/SRR1513331_2.fastq -p 20 -o results/salmon/SRR1513331

salmon quant -i data/ensemble/index -l ISF --gcBias -1 data/fastq/SRR1513333_1.fastq -2 data/fastq/SRR1513333_2.fastq -p 20 -o results/salmon/SRR1513333

salmon merge tables:

python3 workflow/scripts/multipleFieldSelection.py -i results/salmon/SRR1513330/quant.sf results/salmon/SRR1513332/quant.sf results/salmon/SRR1513331/quant.sf results/salmon/SRR1513333/quant.sf results/salmon/SRR1513334/quant.sf results/salmon/SRR1513329/quant.sf -k 1 -f 4 -o results/salmon/iso_tpm.txt

salmon format id:

Rscript workflow/scripts/format_Ensembl_ids.R results/salmon/iso_tpm.txt

suppa get all samples events:

python3 /home/hugo.avila/hugo.avila/repo/SUPPA-2.3/suppa.py psiPerEvent -i data/ensemble/ensembl_hg19.events.ioe -e results/salmon/iso_tpm_formatted.txt -o results/suppa/TRA2_events

correct input plot:# This is a simple oneliner to correct the .psi table to be equal as the one of the tutorial (add EventID header and sort columns).

workflow/scripts/sort_samples.sh results/suppa/TRA2_events.psi > results/suppa/TRA2_events_sorted.psi

create box plot:

mkdir -p results/suppa/boxplot && workflow/scripts/generate_boxplot_event.py -i results/suppa/TRA2_events_sorted.psi -e 'ENSG00000149554;SE:chr11:125496728-125497502:125497725-125499127:+' -g 1-3,4-6 -c NC,KD -o results/suppa/boxplot

split by condition:

workflow/scripts/split_file.R results/salmon/iso_tpm_formatted.txt SRR1513329,SRR1513330,SRR1513331 SRR1513332,SRR1513333,SRR1513334 results/suppa/split_conditions/TRA2_NC_iso.tpm results/suppa/split_conditions/TRA2_KD_iso.tpm -i

workflow/scripts/split_file.R results/suppa/TRA2_events.psi SRR1513329,SRR1513330,SRR1513331 SRR1513332,SRR1513333,SRR1513334 results/suppa/split_conditions/TRA2_NC_events.psi results/suppa/split_conditions/TRA2_KD_events.psi -e

diff splicing analysis:

python3 /home/hugo.avila/hugo.avila/repo/SUPPA-2.3/suppa.py diffSplice -m empirical -gc -i data/ensemble/ensembl_hg19.events.ioe -p results/suppa/split_conditions/TRA2_KD_events.psi results/suppa/split_conditions/TRA2_NC_events.psi -e results/suppa/split_conditions/TRA2_KD_events.psi results/suppa/split_conditions/TRA2_NC_events.psi -o results/suppa/split_conditions/TRA2_diffSplice

suppa_env.yaml.txt https://github.com/comprna/SUPPA/files/8541638/suppa_env.yaml.txt salmon_env.yaml.txt https://github.com/comprna/SUPPA/files/8541639/salmon_env.yaml.txt results_suppa.zip https://github.com/comprna/SUPPA/files/8541705/results_suppa.zip results_samon.zip https://github.com/comprna/SUPPA/files/8541710/results_samon.zip

— Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/143, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB3APSEWAYHTJDMBQQTVGKYNDANCNFSM5UCO4WSA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

EduEyras avatar May 08 '22 03:05 EduEyras

Tks for the reply @EduEyras !

You still got results, which means that it is not a format issue or a problem with the transcript IDs, I guess.

Could you confirm that the current tutorial support files (Ensemble fasta and gtf) and the command lines are the same used to generate those outputs ?

We do not encounter the error with the diffSplice analysis. Could this be a python version issue ?

Maybe, i will check this out and come back with the answer

avilaHugo avatar May 10 '22 15:05 avilaHugo

Hi,

yes, the wiki is self-contained

The data used is the one provided

best

E.

On Wed, 11 May 2022 at 01:29, Hugo L. Ávila @.***> wrote:

Tks for the reply @EduEyras https://github.com/EduEyras !

You still got results, which means that it is not a format issue or a problem with the transcript IDs, I guess.

Could you confirm that the current tutorial support files (Ensemble fasta and gtf) and the command lines are the same used to generate those outputs ?

We do not encounter the error with the diffSplice analysis. Could this be a python version issue ?

Maybe, i will check this out and come back with the answer

— Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/143#issuecomment-1122546024, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB2ZH5DERFTMHBO2ELLVJJ6D7ANCNFSM5UCO4WSA . You are receiving this because you were mentioned.Message ID: @.***>

EduEyras avatar May 11 '22 00:05 EduEyras