xqtl-protocol icon indicating copy to clipboard operation
xqtl-protocol copied to clipboard

Misspecification gtf in STAR_outline MWE of RNA_calling,

Open hsun3163 opened this issue 3 years ago • 9 comments

Originally, to generate the STAR_Index and RSEM_index, the same gtf, the gtf prior to collapse by gene should be used. However, as we decided to do STAR_indexing on the fly, it slip my mind and gave a wrong example in the RNA_calling pipeline test MWE reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gene.gtf.

Therefore, all the Aligned.toTranscript.bam that were generated by Traves and Shrithee can not be used to conduct RSEM analysis.

Luckily, the main expression matrix we used is the one from rnaseqc, based on the Aligned.byCoordinated.bam, which should not be impacted by the gene vs transcriptome.

To fix it, STAR_align (Not the star_output since that will overwrite the existing md.bam which will take double the time) needs to be rerun to regenerate the Aligned.toTranscript.bam file (edited)

The consequence is that the RSEM bed table will not be generated on time

hsun3163 avatar Sep 13 '22 16:09 hsun3163

The error message this will produce is: Warning: The SAM/BAM file declares less reference sequences (58678) than RSEM knows (234381)! and RSEM can not recognize reference sequence name ENSG00000223972!

hsun3163 avatar Sep 13 '22 16:09 hsun3163

Beside the toTranscriptome.bam, another file that will be impacted is the .SJ.out.tab which will be needed by psichomics to generate the sQTL.

hs3163@node65:/mnt/vast/hpc/csg/xqtl_workflow_testing/finalizing$ wc output/rnaseq_arch/22.SJ.out.tab
 12643 113787 477595 output/rnaseq_arch/22.SJ.out.tab
hs3163@node65:/mnt/vast/hpc/csg/xqtl_workflow_testing/finalizing$ wc output/rnaseq/Sample_22.SJ.out.tab
 13191 118719 497187 output/rnaseq/Sample_22.SJ.out.tab

in principle however, the alternative to psichomics, leafcutter, shall not be impacted for it used the Aligned.byCoordinate.bam. Which should be independent of the gtf.

@Rhopala could you update the ticket with supporting evidence while u have it?

hsun3163 avatar Sep 13 '22 19:09 hsun3163

So, toTranscriptome and SJ.out.tab are not to be used at the moment?

Shrishtee-kandoi avatar Sep 13 '22 19:09 Shrishtee-kandoi

So, toTranscriptome and SJ.out.tab are not to be used at the moment?

Unfortunately, yes

hsun3163 avatar Sep 13 '22 19:09 hsun3163

@Shrishtee-kandoi let's focus on generating results for leafcutter for the upcoming release.

gaow avatar Sep 13 '22 21:09 gaow

Sure, yes !

Shrishtee-kandoi avatar Sep 14 '22 20:09 Shrishtee-kandoi

As it turns out, the output of the leafcutter is indeed impacted. Am verifying whether the result for RNAseq will be. It really should not because rnaseqc uses the same gene.gtf anyway.

hsun3163 avatar Sep 16 '22 17:09 hsun3163

After testing we found the Aligned.byCoordinated.bam files are actually impacted. Only tiny differences are observed in the bamfiles such as: image The intron usage ratio could have a ±1 for numerator and denominators: image and cause extensive value change & row number change in final phenotype table: image image

Rhopala avatar Sep 16 '22 17:09 Rhopala

Let's call the what we have before using the collapsed gtf the impacted bam and then the good bam

The difference in the gene_count table for gene expression. 1298/1372 from the impacted bam and 1304/1382 from the good bam are correctly mapped to chrom 21 and 22. 14 out of 1294 genes have a correlation < 0.9. In summary, there are not huge differences but there are differences.

hsun3163 avatar Sep 16 '22 19:09 hsun3163