rnaseq Unclear description of output files for alignment and quantification step

Unclear description of output files for alignment and quantification step

Open danwiththeplan opened this issue 2 years ago • 4 comments

Request for unambiguous description of the most important output files

First, apologies and feel free to close this issue if this has been addressed in nf-core/rnaseq/3.4.

When running nf-core/rnaseq/3.0, the description of the output files in the directory star_salmon is incomplete. In the Output docs the description of the output files is as follows:

star_salmon/
*.Aligned.out.bam: If --save_align_intermeds is specified the original BAM file containing read alignments to the reference genome will be placed in this directory.
*.Aligned.toTranscriptome.out.bam: If --save_align_intermeds is specified the original BAM file containing read alignments to the transcriptome will be placed in this directory.
star_salmon/log/
*.SJ.out.tab: File containing filtered splice junctions detected after mapping the reads.
*.Log.final.out: STAR alignment report containing the mapping results summary.
*.Log.out and *.Log.progress.out: STAR log files containing detailed information about the run. Typically only useful for debugging purposes.
star_salmon/unmapped/
*.fastq.gz: If --save_unaligned is specified, FastQ files containing unmapped reads will be placed in this directory.

I have just run nf-core/rnaseq/3.0 with the following parameters, the run completed successfully.

nextflow \
run \
nf-core/rnaseq \
-r 3.0 \
-profile singularity \
--input samplesheet.csv \
--fasta mygenome.fsa \
--gff mygff.gff3 \
--skip_biotype_qc \
--outdir my_output_dir \
--remove_ribo_rna \
--ribo_database_manifest rRNAmanifest.txt \
--save_non_ribo_reads

For simplicity I've only included 1 sample bam and bam.bai file (there were 60), but the output files I got in the directory star_salmon were:


mysample.markdup.sorted.bam
mysample.markdup.sorted.bam.bai
salmon.merged.gene_counts_length_scaled.rds
salmon.merged.gene_counts_length_scaled.tsv
salmon.merged.gene_counts.rds
salmon.merged.gene_counts_scaled.rds
salmon.merged.gene_counts_scaled.tsv
salmon.merged.gene_counts.tsv
salmon.merged.gene_tpm_length_scaled.tsv
salmon.merged.gene_tpm_scaled.tsv
salmon.merged.gene_tpm.tsv
salmon.merged.transcript_counts.rds
salmon.merged.transcript_counts.tsv
salmon.merged.transcript_tpm.tsv

The output folders I got in the star_salmon directory were:

mysample (1 directory for each sample, there were 60 but I've just included one for simplicity)
bigwig
dupradar
log
picard_metrics
preseq
qualimap
rseqc
samtools_stats
stringtie

Each individual sample had an output subfolder named after the sample name (ie mysample). Within each individual sample folder, the output files I got were:

mysample.salmon.gene_counts_length_scaled.tsv
mysample.salmon.gene_counts_scaled.tsv
mysample.salmon.gene_counts.tsv
mysample.salmon.gene_tpm_length_scaled.tsv
mysample.salmon.gene_tpm_scaled.tsv
mysample.salmon.gene_tpm.tsv
mysample.salmon.transcript_counts.tsv
mysample.salmon.transcript_tpm.tsv
aux_info (directory)
cmd_info.json
libParams (directory)
logs (directory)
quant.genes.sf
quant.sf

Describe the solution you'd like

This is a great piece of work and I really appreciate what's been done here. I can have a pretty good guess about what all these files are, and I suspect that a lot of the files are either irrelevant, log files, or documented in STAR or Salmon. For example, the quant.sf file is well described here

That said, a lot is left unclear (what's the difference between quant.sf and quant.genes.sf?) and there is a lot of room for confusion, especially about the difference between (for example) these three files:

mysample.salmon.gene_counts_length_scaled.tsv
mysample.salmon.gene_counts_scaled.tsv
mysample.salmon.gene_counts.tsv

What exactly is length_scaled, scaled etc?

I understand this is a lot of work, but the described outputs don't actually match the actual outputs. The most important thing for me would be an unambigous description of what these files all are:

salmon.merged.gene_counts_length_scaled.tsv
salmon.merged.gene_counts_scaled.tsv
salmon.merged.gene_counts.tsv
salmon.merged.gene_tpm_length_scaled.tsv
salmon.merged.gene_tpm_scaled.tsv
salmon.merged.gene_tpm.tsv
salmon.merged.transcript_counts.rds
salmon.merged.transcript_counts.tsv
salmon.merged.transcript_tpm.tsv

that is, what's the difference between "length_scaled", "scaled" and what is presumably raw counts, and what's the difference between gene_counts and transcript_counts.

As I said this is a great piece of work, really appreciate it, and I can make a pretty good guess about what all these files are. For example, if I look at one subfolder:

>cat mysample/quant.sf|grep gene1
gene1     1901    1651.000        33.610447       898.823

..then look at `mysample/mysample_salmon_gene_counts.tsv

>head -2 mysample/mysample.salmon.gene_counts.tsv
mysample
gene1        898.823

..then look at salmon.merged.gene_counts.tsv (removed the other 59 samples for clarity)

>head -2 salmon.merged.gene_counts.tsv
gene_id mysample
gene1 898.823

Then I can see what's going on here. But it would be nice if the outputs more closely matched the outputs described in the manual and for the difference between "length_scaled", "scaled" and what is presumably raw counts, and the difference between gene_counts and transcript_counts to be explicitly stated.

I know this seems obvious, but I've seen a lot of confusion in this area, and any room for confusion or any requirement for detective work can lead to a lot of bad code and bad analyses.

Thanks very much!

Describe alternatives you've considered

Since a lot of this information is probably in external documents maybe a link to those for now?

Nov 09 '21 02:11 danwiththeplan

The description of output files that @danwiththeplan mentioned is very important for the users who are not familiar with these tools to perform downstream analysis and data interpretation. Really appreciate your amazing works done for the communities.

Nov 15 '21 08:11 danielsu0523

I also appreciate the tremendous work on this, but came across this as well in the Output files section for Salmon. Specifically, the R object files (salmon.merged.gene_counts.rds, salmon.merged.gene_counts_scaled.rds, and salmon.merged.gene_counts_length_scaled.rds) all have the same description:

RDS object that can be loaded in R that contains a [SummarizedExperiment](https://bioconductor.org/packages/
release/bioc/html/SummarizedExperiment.html) container with the TPM (abundance), estimated counts (counts) 
and transcript length (length) in the assays slot for genes.

The documentation already references tximport processing, but it may be helpful to describe these a bit more to minimize user confusion.

Feb 23 '22 16:02 schaffman5

Hi guys! Thanks for posting here and for using the pipeline. I agree that these docs are a little sparse and can be a little confusing. If you are able to help improve the docs based on your learnings that would be great. I have also asked in the #rnaseq channel on nf-core Slack to see if we can get some volunteers.

Apr 26 '22 14:04 drpatelh

Hi!

I was looking at the current state of the outputs.md file and saw that there are definitions and differences for all the salmon output files are already in the output documentation, under the Salmon section and not the STAR section. I'm unaware if they already were there when the issue was opened, I see they were added six months ago but the issue was opened before that, so I'm wondering if this is already solved and the issue can be closed, or if the definitions can be further improved.

I do see that the definitions of .rds files are still the same between them, but I can change this of course.

Let me know, thanks!

Oct 10 '22 13:10 ctuni

@danwiththeplan @danielsu0523 @schaffman5 were the changes @ctuni added in https://github.com/nf-core/rnaseq/pull/885 satisfactory to clarify the points you raised here? If not, be great if you are able to suggest what else we could improve and a PR would be a even better. Thanks!

Dec 16 '22 11:12 drpatelh

Closing for now but feel to re-open this or a new issue (or a PR!) if required.

Dec 19 '22 10:12 drpatelh

rnaseq rnaseq copied to clipboard

Unclear description of output files for alignment and quantification step

Request for unambiguous description of the most important output files

Describe the solution you'd like

Describe alternatives you've considered

rnaseq
rnaseq copied to clipboard