rnaseq
rnaseq copied to clipboard
Ensembl GRCh38
As discussed at length in #159 , the current default GRCh38 genome (iGenome) confusingly uses the NCBI reference, whereas other genomes use Ensembl by default (e.g. GRCh37, GRCm38).
This configuration can be seen in the igenomes.config
file here:
https://github.com/nf-core/rnaseq/blob/bc5fc76f40b2da6082a854927184c9d6e5060393/conf/igenomes.config
This related comment documents how a user can define custom input of FASTA and GTF files:
You can use the
--fasta
and--gtf
flag to use your own files. You can download them from http://www.ensembl.org/index.html.If you then also add
--saveReference
the indices for STAR and the BED file is stored to results. If you want to rerun the pipeline in the same or another project you can add these generated references by--star_index
and--bed12
.
What is the current consensus on the best available files to use from Ensembl (and/or GENCODE)? I'm primarily interested in the pseudoalignment output from salmon and kallisto.
Are these reference files acceptable to use, even if they have some potential mapping issues with aligners such as STAR?
Ensembl 102, for example:
- ftp://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
- ftp://ftp.ensembl.org/pub/release-102/gtf/homo_sapiens/Homo_sapiens.GRCh38.102.gtf.gz
Note that not all transcripts defined in the transcriptome FASTA are also defined in Homo_sapiens.GRCh38.102.gtf.gz
, but all of them are defined in the alternative Homo_sapiens.GRCh38.102.chr_patch_hapl_scaff.gtf.gz
GTF file.
See also Heng Li's post on which human reference to use for GRCh38, and tximeta, which is working to improve this on the R side of things.
Current hashtable of recommendations from tximeta: https://github.com/mikelove/tximeta/blob/master/inst/extdata/hashtable.csv
Best, Mike