rnaseq icon indicating copy to clipboard operation
rnaseq copied to clipboard

Ensembl GRCh38

Open mjsteinbaugh opened this issue 5 years ago • 10 comments

As discussed at length in #159 , the current default GRCh38 genome (iGenome) confusingly uses the NCBI reference, whereas other genomes use Ensembl by default (e.g. GRCh37, GRCm38).

This configuration can be seen in the igenomes.config file here: https://github.com/nf-core/rnaseq/blob/bc5fc76f40b2da6082a854927184c9d6e5060393/conf/igenomes.config

This related comment documents how a user can define custom input of FASTA and GTF files:

You can use the --fasta and --gtf flag to use your own files. You can download them from http://www.ensembl.org/index.html.

If you then also add --saveReference the indices for STAR and the BED file is stored to results. If you want to rerun the pipeline in the same or another project you can add these generated references by --star_index and --bed12.

What is the current consensus on the best available files to use from Ensembl (and/or GENCODE)? I'm primarily interested in the pseudoalignment output from salmon and kallisto.

Are these reference files acceptable to use, even if they have some potential mapping issues with aligners such as STAR?

Ensembl 102, for example:

  • ftp://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
  • ftp://ftp.ensembl.org/pub/release-102/gtf/homo_sapiens/Homo_sapiens.GRCh38.102.gtf.gz

Note that not all transcripts defined in the transcriptome FASTA are also defined in Homo_sapiens.GRCh38.102.gtf.gz, but all of them are defined in the alternative Homo_sapiens.GRCh38.102.chr_patch_hapl_scaff.gtf.gz GTF file.

See also Heng Li's post on which human reference to use for GRCh38, and tximeta, which is working to improve this on the R side of things.

Current hashtable of recommendations from tximeta: https://github.com/mikelove/tximeta/blob/master/inst/extdata/hashtable.csv

Best, Mike

mjsteinbaugh avatar Dec 09 '20 15:12 mjsteinbaugh

@mikelove @lpantano @nturaga @csoneson Any thoughts on this? I much prefer using Ensembl over NCBI because the annotations available via AnnotationHub are so much easier to use in R.

mjsteinbaugh avatar Dec 09 '20 15:12 mjsteinbaugh

I typically recommend GENCODE/Ensembl for RNA-seq as the versioning/release mechanisms (and ftp structure) are more standard and easier to navigate. But we support non-latest versions of RefSeq in tximeta.

mikelove avatar Dec 09 '20 15:12 mikelove

Hi @mjsteinbaugh ! Hope you and loved ones are well :)

Yes, you are quite right in that the default GRCh38 genome we have available for use is from the NCBI and not from ENSEMBL. This has led to quite a bit of confusion which is why I even added a warning to the pipeline in the last release to provide users with a more explicit message.

Agree with @mikelove and that is essentially what I recommended here to overcome the NCBI GRCh38 issue. These pipelines are very flexible in terms of the genome reference files you can provide and all dependencies to run the pipeline can be generated from just an input --fasta and --gtf. There are also special options that have been added to deal with GENCODE annotation if for example using pseudo-alignment with Salmon floats your ⛵.

The AWS iGenomes resource we have been using for quite some time now has been a great servant mainly because you can provide a key e.g. --genome GRCh37 and the pipeline will by default download all of the required reference files from an Amazon S3 bucket, generate all of the indices (if not explicitly provided), and as you mentioned you also have the option to use the --save_reference parameter to store these files for re-use with other analysis. Having said all of that, we are aware that AWS iGenomes is now quite outdated and have been meaning to add Refgenie support in anger for quite some time now...it will be added at some point 😅

Ultimately, I think the decision will fall on the user and maybe the environment in which they are performing the analysis but ENSEMBL / GENCODE seem to be the most popular options nowadays.

drpatelh avatar Dec 09 '20 15:12 drpatelh

Thanks @drpatelh , that's what I've been doing so far and wanted to make sure that was still the current recommendation.

mjsteinbaugh avatar Dec 09 '20 16:12 mjsteinbaugh

I pretty much agree with what was said above - I tend to use GENCODE if I can, and Ensembl otherwise. If you use Ensembl cDNA fasta files, you may want to combine them with the ncRNA fasta file to get the full catalog of transcripts.

csoneson avatar Dec 09 '20 16:12 csoneson

Yes, agree with @csoneson on adding the non-coding, and tximeta supports the combination of cDNA + ncRNA sequences from Ensembl.

mikelove avatar Dec 09 '20 16:12 mikelove

If you supply the pipeline with just a --fasta and --gtf and providing that the non-coding transcripts are defined in the GTF file (I believe they are in ENSEMBL GTF) then the pipeline will automatically create a transcript fasta using gffread for use where required e.g. with Salmon.

Note: I am planning another major release before Xmas that will replace the STAR / featureCounts option with STAR / Salmon for more accurate quantification from aligned BAMs. @rob-p has been a massive help on nf-core Slack advising on the implementation side 🙇🏽

The pipeline will still offer a pure pseudo-alignment route with Salmon and possibly Kallisto in the near future.

drpatelh avatar Dec 09 '20 17:12 drpatelh

Hi @mjsteinbaugh ! Just tidying up issues before tomorrow's release. Is this ok to close?

drpatelh avatar Dec 14 '20 17:12 drpatelh

OK to close

mjsteinbaugh avatar Dec 14 '20 18:12 mjsteinbaugh

So what is the current best recommendation, supplying genome fasta, GTF, bed and STAR or just stick to GRCh37 to circumvent the warning? Reference genome options genome : GRCh37 fasta : s3://ngi-igenomes/igenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa gtf : s3://ngi-igenomes/igenomes/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf gene_bed : s3://ngi-igenomes/igenomes/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.bed star_index : s3://ngi-igenomes/igenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/

animesh avatar Nov 29 '23 15:11 animesh

We now have clear recommendations for this.

iGenomes content is now outdated beyond the point of usefulness for RNA-seq. The iGenomes GRCh38 in particular has given me personally many support headaches in nf-core/differentialabundance, using as it does non-unique gene symbols as identifiers.

So, until such time as we provide a modernised approach to the iGenomes config (there are nf-core community efforts in this regard), we are recommending against use of the --genomes in favour of the guidelines above.

pinin4fjords avatar May 31 '24 09:05 pinin4fjords