rnaseq
rnaseq copied to clipboard
Unclear description of output files for alignment and quantification step
Request for unambiguous description of the most important output files
First, apologies and feel free to close this issue if this has been addressed in nf-core/rnaseq/3.4
.
When running nf-core/rnaseq/3.0
, the description of the output files in the directory star_salmon
is incomplete.
In the Output docs the description of the output files is as follows:
star_salmon/
*.Aligned.out.bam: If --save_align_intermeds is specified the original BAM file containing read alignments to the reference genome will be placed in this directory.
*.Aligned.toTranscriptome.out.bam: If --save_align_intermeds is specified the original BAM file containing read alignments to the transcriptome will be placed in this directory.
star_salmon/log/
*.SJ.out.tab: File containing filtered splice junctions detected after mapping the reads.
*.Log.final.out: STAR alignment report containing the mapping results summary.
*.Log.out and *.Log.progress.out: STAR log files containing detailed information about the run. Typically only useful for debugging purposes.
star_salmon/unmapped/
*.fastq.gz: If --save_unaligned is specified, FastQ files containing unmapped reads will be placed in this directory.
I have just run nf-core/rnaseq/3.0
with the following parameters, the run completed successfully.
nextflow \
run \
nf-core/rnaseq \
-r 3.0 \
-profile singularity \
--input samplesheet.csv \
--fasta mygenome.fsa \
--gff mygff.gff3 \
--skip_biotype_qc \
--outdir my_output_dir \
--remove_ribo_rna \
--ribo_database_manifest rRNAmanifest.txt \
--save_non_ribo_reads
For simplicity I've only included 1 sample bam
and bam.bai
file (there were 60), but the output files I got in the directory star_salmon
were:
mysample.markdup.sorted.bam
mysample.markdup.sorted.bam.bai
salmon.merged.gene_counts_length_scaled.rds
salmon.merged.gene_counts_length_scaled.tsv
salmon.merged.gene_counts.rds
salmon.merged.gene_counts_scaled.rds
salmon.merged.gene_counts_scaled.tsv
salmon.merged.gene_counts.tsv
salmon.merged.gene_tpm_length_scaled.tsv
salmon.merged.gene_tpm_scaled.tsv
salmon.merged.gene_tpm.tsv
salmon.merged.transcript_counts.rds
salmon.merged.transcript_counts.tsv
salmon.merged.transcript_tpm.tsv
The output folders I got in the star_salmon
directory were:
mysample (1 directory for each sample, there were 60 but I've just included one for simplicity)
bigwig
dupradar
log
picard_metrics
preseq
qualimap
rseqc
samtools_stats
stringtie
Each individual sample had an output subfolder named after the sample name (ie mysample
).
Within each individual sample folder, the output files I got were:
mysample.salmon.gene_counts_length_scaled.tsv
mysample.salmon.gene_counts_scaled.tsv
mysample.salmon.gene_counts.tsv
mysample.salmon.gene_tpm_length_scaled.tsv
mysample.salmon.gene_tpm_scaled.tsv
mysample.salmon.gene_tpm.tsv
mysample.salmon.transcript_counts.tsv
mysample.salmon.transcript_tpm.tsv
aux_info (directory)
cmd_info.json
libParams (directory)
logs (directory)
quant.genes.sf
quant.sf
Describe the solution you'd like
This is a great piece of work and I really appreciate what's been done here. I can have a pretty good guess about what all these files are, and I suspect that a lot of the files are either irrelevant, log files, or documented in STAR or Salmon. For example, the quant.sf
file is well described here
That said, a lot is left unclear (what's the difference between quant.sf
and quant.genes.sf
?) and there is a lot of room for confusion, especially about the difference between (for example) these three files:
mysample.salmon.gene_counts_length_scaled.tsv
mysample.salmon.gene_counts_scaled.tsv
mysample.salmon.gene_counts.tsv
What exactly is length_scaled
, scaled
etc?
I understand this is a lot of work, but the described outputs don't actually match the actual outputs. The most important thing for me would be an unambigous description of what these files all are:
salmon.merged.gene_counts_length_scaled.tsv
salmon.merged.gene_counts_scaled.tsv
salmon.merged.gene_counts.tsv
salmon.merged.gene_tpm_length_scaled.tsv
salmon.merged.gene_tpm_scaled.tsv
salmon.merged.gene_tpm.tsv
salmon.merged.transcript_counts.rds
salmon.merged.transcript_counts.tsv
salmon.merged.transcript_tpm.tsv
that is, what's the difference between "length_scaled", "scaled" and what is presumably raw counts, and what's the difference between gene_counts
and transcript_counts
.
As I said this is a great piece of work, really appreciate it, and I can make a pretty good guess about what all these files are. For example, if I look at one subfolder:
>cat mysample/quant.sf|grep gene1
gene1 1901 1651.000 33.610447 898.823
..then look at `mysample/mysample_salmon_gene_counts.tsv
>head -2 mysample/mysample.salmon.gene_counts.tsv
mysample
gene1 898.823
..then look at salmon.merged.gene_counts.tsv
(removed the other 59 samples for clarity)
>head -2 salmon.merged.gene_counts.tsv
gene_id mysample
gene1 898.823
Then I can see what's going on here. But it would be nice if the outputs more closely matched the outputs described in the manual and for the difference between "length_scaled", "scaled" and what is presumably raw counts, and the difference between gene_counts
and transcript_counts
to be explicitly stated.
I know this seems obvious, but I've seen a lot of confusion in this area, and any room for confusion or any requirement for detective work can lead to a lot of bad code and bad analyses.
Thanks very much!
Describe alternatives you've considered
Since a lot of this information is probably in external documents maybe a link to those for now?
The description of output files that @danwiththeplan mentioned is very important for the users who are not familiar with these tools to perform downstream analysis and data interpretation. Really appreciate your amazing works done for the communities.
I also appreciate the tremendous work on this, but came across this as well in the Output files section for Salmon. Specifically, the R object files (salmon.merged.gene_counts.rds
, salmon.merged.gene_counts_scaled.rds
, and salmon.merged.gene_counts_length_scaled.rds
) all have the same description:
RDS object that can be loaded in R that contains a [SummarizedExperiment](https://bioconductor.org/packages/
release/bioc/html/SummarizedExperiment.html) container with the TPM (abundance), estimated counts (counts)
and transcript length (length) in the assays slot for genes.
The documentation already references tximport processing, but it may be helpful to describe these a bit more to minimize user confusion.
Hi guys! Thanks for posting here and for using the pipeline. I agree that these docs are a little sparse and can be a little confusing. If you are able to help improve the docs based on your learnings that would be great. I have also asked in the #rnaseq channel on nf-core Slack to see if we can get some volunteers.
Hi!
I was looking at the current state of the outputs.md
file and saw that there are definitions and differences for all the salmon
output files are already in the output documentation, under the Salmon section and not the STAR section. I'm unaware if they already were there when the issue was opened, I see they were added six months ago but the issue was opened before that, so I'm wondering if this is already solved and the issue can be closed, or if the definitions can be further improved.
I do see that the definitions of .rds
files are still the same between them, but I can change this of course.
Let me know, thanks!
@danwiththeplan @danielsu0523 @schaffman5 were the changes @ctuni added in https://github.com/nf-core/rnaseq/pull/885 satisfactory to clarify the points you raised here? If not, be great if you are able to suggest what else we could improve and a PR would be a even better. Thanks!
Closing for now but feel to re-open this or a new issue (or a PR!) if required.