drop icon indicating copy to clipboard operation
drop copied to clipboard

Annotating results with Ensembl Id or HGNC Id

Open Lucpen opened this issue 2 years ago • 2 comments

Hi,

Thanks for such an amazing work with DROP, it is very useful and we are using it very often to look for aberrant expression and splicing in rare disease patients.

I wanted to ask you if there is a simple way to get FRASER output annotated with either Ensembl gene ID or HGNC gene ID other than converting the provided hgncSymbol after running DROP. I am unsure if this is default behaviour or if its caused by the files we provide it, so I have attached drop_config.txt and below you can find the initial lines of the gtf file used.

Thanks for your help,

Lucía

##description: evidence-based annotation of the human genome (GRCh38), version 37 (Ensembl 103)
##provider: GENCODE
##contact: [email protected]

##format: gtf
##date: 2020-12-07
chr1	HAVANA	gene	11869	14409	.	+	.	gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2";
chr1	HAVANA	transcript	11869	14409	.	+	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	HAVANA	exon	11869	12227	.	+	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	HAVANA	exon	12613	12721	.	+	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	HAVANA	exon	13221	14409	.	+	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

Lucpen avatar Jan 16 '24 12:01 Lucpen

Hi Lucía, Hope all is good on your side! Can you send one row of your FRASER results to see how the gene looks like? I always use the Gencode gtf file for my projects. The hgncSymbol corresponds to the HGNC gene ID, or what exactly are you looking for? To add Ensembl IDs, DROP outputs the file {root}/processed_data/preprocess/{gene_annotation}/gene_name_mapping_{gene_annotation}.tsv that contains both the hgncSymbol and the Ensembl IDs and can be used to merge with splicing and expression results.

vyepez88 avatar Jan 18 '24 10:01 vyepez88

Hi Vicente!

Thanks for the fast response!

All is good, I hope it is with you too! :)

This is the first line of our results: ''' sampleID seqnames start end width strand hgncSymbol type pValue psiValue deltaPsi counts totalCounts meanCounts meanTotalCounts nonsplitCounts nonsplitProportion nonsplitProportion_99quantile annotatedJunction pValueGene padjustGene PAIRED_END INDIVIDUAL_ID DNA_ID DROP_GROUP SPLICE_COUNTS_DIR HPO_TERMS GENE_COUNTS_FILE GENE_ANNOTATION GENOME SEX LIBPREP TISSUE AFFECTED isExternal potentialImpact causesFrameshift UTR_overlap blacklist ''' hgncSymbol would be something like NCF1C and we would like to get either the hgnc id 32523 or Ensembl id ENSG00000165178. Is it possible for this to happen inside DROP as it happens for OUTRIDER (we use the same gtf and the results come with the Ensembl id)? Is this something related to the reference files we are using?

Thanks again!

Lucpen avatar Jan 18 '24 15:01 Lucpen

Hi Lucia! So sorry, I thought I had answered this. As commented at the ESHG, the workarounds for this would be:

  1. DROP uses the 'gene_id' column of the provided gtf file. Therefore, if you provide a gtf file with the hgnc ids as the main ids, the OUTRIDER tables would have those ids as unique identifiers. See this script for more info: https://github.com/gagneurlab/drop/blob/master/drop/template/Scripts/Pipeline/preprocessGeneAnnotation.R However, I advice against having numbers as gene identifiers and keeping the ensemble ids
  2. You can edit the https://github.com/gagneurlab/drop/blob/master/drop/modules/aberrant-expression-pipeline/Counting/mergeCounts.R script to add a step to merge the colData of the ods object with hgnc ids
  3. My preferred one and maybe the easiest: you can leave DROP as is and then make follow-up scripts where you add your info of preference (e.g. hgnc ids, OMIM genes, Phenotype, Family ID, Affected status, etc.).

vyepez88 avatar Jun 10 '24 13:06 vyepez88

No worries. Thanks Vicente! :D

Lucpen avatar Jun 10 '24 13:06 Lucpen