rnaseq icon indicating copy to clipboard operation
rnaseq copied to clipboard

Ensembl GRCh38

Open mjsteinbaugh opened this issue 3 years ago • 10 comments

As discussed at length in #159 , the current default GRCh38 genome (iGenome) confusingly uses the NCBI reference, whereas other genomes use Ensembl by default (e.g. GRCh37, GRCm38).

This configuration can be seen in the igenomes.config file here: https://github.com/nf-core/rnaseq/blob/bc5fc76f40b2da6082a854927184c9d6e5060393/conf/igenomes.config

This related comment documents how a user can define custom input of FASTA and GTF files:

You can use the --fasta and --gtf flag to use your own files. You can download them from http://www.ensembl.org/index.html.

If you then also add --saveReference the indices for STAR and the BED file is stored to results. If you want to rerun the pipeline in the same or another project you can add these generated references by --star_index and --bed12.

What is the current consensus on the best available files to use from Ensembl (and/or GENCODE)? I'm primarily interested in the pseudoalignment output from salmon and kallisto.

Are these reference files acceptable to use, even if they have some potential mapping issues with aligners such as STAR?

Ensembl 102, for example:

  • ftp://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
  • ftp://ftp.ensembl.org/pub/release-102/gtf/homo_sapiens/Homo_sapiens.GRCh38.102.gtf.gz

Note that not all transcripts defined in the transcriptome FASTA are also defined in Homo_sapiens.GRCh38.102.gtf.gz, but all of them are defined in the alternative Homo_sapiens.GRCh38.102.chr_patch_hapl_scaff.gtf.gz GTF file.

See also Heng Li's post on which human reference to use for GRCh38, and tximeta, which is working to improve this on the R side of things.

Current hashtable of recommendations from tximeta: https://github.com/mikelove/tximeta/blob/master/inst/extdata/hashtable.csv

Best, Mike

mjsteinbaugh avatar Dec 09 '20 15:12 mjsteinbaugh