mondo icon indicating copy to clipboard operation
mondo copied to clipboard

Use current HGNC names in definitions

Open ValWood opened this issue 4 years ago • 22 comments

Many terms

MONDO:0013519 A dyskeratosis congenita that has_material_basis_in an autosomal recessive mutation of NOLA2 on chromosome 5q35.3.

NOLA2 = NHP2

MONDO:0009136 Definition A dyskeratosis congenita that has_material_basis_in an autosomal dominant mutation of NOLA3 on chromosome 15q14.

NOLA3 = NOP10

It would be really nice if you could use current HGNC names in the definitions. Could this be automated? It's quite important as many names are changed because they may cause offence in a clinical setting https://blog.genenames.org/hgnc/2019/09/30/Minimising_changes/

ValWood avatar May 21 '20 15:05 ValWood

Also, it's a PINTA to need to look them up :)

ValWood avatar May 21 '20 15:05 ValWood

@ValWood Hey Val. My apologies for leaning on you for more info; I'm new here.

This sounds automatable to me. But I'd need a list of these mappings. If I have a 2 column CSV file, for example, where column 1 has the term to be replaced, and column 2 contains the replacement, then it seems to me that this would be easy from there.

@matentzn Where would be the best place to do this work? I know exactly where I'd do it for OMIM work, but for mondo in general, where would I work on this? Is there a build process outside of the makefiles? I'm guessing not. Then, which commands that are currently run would be used to build mondo? Is it OK if I use Python to do this or is there a preference for using unix commands to do this kind of work?

joeflack4 avatar Oct 18 '21 01:10 joeflack4

HI @joeflack4 !

@sartweedie at the HGNC should be able to point you to a mapping source for human gene synonyms to current names.

ValWood avatar Oct 18 '21 07:10 ValWood

@joeflack4 Please mention this issue again on the QC call, we already have the table we need (from Monarch Dipper), so we need to talk about how we use that to dynamically update the labels. This should be done using DOSDP - its a super useful ticket for you to do, but I would warn you that the effort is more in the XL or XXL range when you first attempt it!

matentzn avatar Oct 18 '21 09:10 matentzn

I added this to our QC call agenda

nicolevasilevsky avatar Oct 18 '21 15:10 nicolevasilevsky

Looking at this with fresh eyes after awhile.

@ValWood Sorry that this has been open so long. When you say "HGNC names", I think what you mean is the "HGNC symbol", does that sound correct?

I'm just a beginner at HGNC. I know that there are HGNC IDs, which I believe don't change. And there are HGNC symbols, which do change over time. I'm not sure why this is. And I'm also not sure what % of these (a) become mapped to different genes / HGNC IDs, or what % (b) simply become deprecated and aren't used any more.

What I need and why this might be hard The big problem here is that I don't have good data sources to work with, and I'm not sure where to look. I could imagine this working as follows: If I was able to get the (1) dates at which each of these HGNC symbols were inserted into Mondo labels (that is, the date they were effectively mapped) and, for all of these dates / date ranges, if I had (2) files that mapped those HGNC symbols to the non-changing HGNC ID for those given dates, that would be all I need. I could then use (3) current HGNC ID to symbol mappings (which are updated regularly) to get the new symbol, remap them to Mondo, and insert the new symbol into the Mondo label.

Do we know anyone that can help? I have (3). But I don't know if it's even possible to obtain (1) or (2), or where to look. @nicolevasilevsky Without those, I'm not sure how I would even go about approaching this task. But if there is anyone we have access to that has (a) a better understanding of HGNC than I, and/or (b) someone who knows how these HGNC symbols were originally mapped to Mondo terms, that would be extremely helpful.

joeflack4 avatar Apr 22 '22 16:04 joeflack4

Yes I mean symbol sorry about that.

symbols change over time for various reasons . hopefully @sartweedie can explain more about the name changes.

ValWood avatar Apr 22 '22 17:04 ValWood

here is some background on name changes: https://pubmed.ncbi.nlm.nih.gov/32747822/

ValWood avatar Apr 22 '22 17:04 ValWood

Thanks Val! I saw the other article you also linked earlier in the issue. I'll take a look at both.

joeflack4 avatar Apr 22 '22 17:04 joeflack4

Sorry for taking a while to respond to this. On the whole HGNC try NOT to change gene symbols as it is disruptive but, as you can see in our recent guidelines that Val pointed you to, there can be good reasons to change them. it is definitely preferable to rely on linking to HGNC IDs but I appreciate they are less user friendly in a definition. If you have a list of all human gene symbols that appear in Mondo you can check they are valid using our https://www.genenames.org/tools/multi-symbol-checker/ to get see how many symbols need updated now. Hopefully all the symbols references were 'approved' rather than aliases at the time the definition was written (there are some approved symbols that are persistently used as alternative symbols for other genes). We are currently reviewing all human symbols and marking them as stable if we are confident they won't change in future. We are prioritising disease related genes so hopefully there will be fewer changes going forward.

sartweedie avatar Apr 25 '22 13:04 sartweedie

Thanks for chiming in @sartweedie . That looks like a helpful tool that we may be able to utilize.

I have a feature request / ask. If you happen to know the dev(s) that put together that symbol-checker tool, do you think it would be possible for them to expose it as a REST API? We try to automate as much as possible, and being able to tie in this tool as an API as part of our scripts / ingest processes could be very helpful.

joeflack4 avatar Apr 25 '22 23:04 joeflack4

@joeflack4 I spoke to our developer Kris who said he will keep your request in mind and will look into implementing this feature in the future. He said storing the IDs is vital - then you could use our FTP archive files to look for changes once you are synced with our data. i.e https://www.genenames.org/download/archive/

I notice that the gene symbols within the Term Relations are up-to-date and linked to our site via HGNC IDs e.g. MONDO:0013519 shows hyperlinked NHP2 in the Term Relations but NOLA2 (unlinked) in the definition. Not sure if that helps you move forward... I guess it may not be trivial to know when a definition contains as gene symbol.

sartweedie avatar Apr 26 '22 11:04 sartweedie

Thanks for speaking with him. I appreciate that! Yeah, I agree that storing the IDs is vital, now that I understand more about HGNC.

@matentzn Just an FYI, you may already be aware of this, but I wanted to draw your attention to the above comment. It looks like there may be some outdated HGNC symbols / inconsistencies in Mondo itself, not just our OMIM ingest.

joeflack4 avatar Apr 26 '22 20:04 joeflack4

@joeflack4 thank you for driving discussion, please add to QC call agenda!

matentzn avatar Apr 27 '22 09:04 matentzn

Ping add to agenda. @joeflack4 explain this issue to team.

matentzn avatar Apr 08 '24 16:04 matentzn

After looking at this again, here's my rough sketch of what needs to be done here.

Ensure HGNC names (symbols) are up-to-date

  • [ ] 1. From Mondo's OMIM ingest
    • [ ] The OMIM pipeline would get data from https://www.genenames.org/download/archive/ over FTP, and ideally we would do a lookup using the stable HGNC IDs provided by OMIM to get the latest HGNC symbols, and incorporate those where needed
    • [ ] While I'm doing this, I should also look at data/hgnc/ to see if these are being utilized in any way, and remove if not.
  • [ ] 2. From wherever else in Mondo
    • [ ] 1 time correction: I suppose it will involve doing some sort of query to identify all of the classes with HGNC references and ensure, via the same FTP as in (1) for OMIM, that the names (symbols) are up-to-date via a ROBOT template.
    • [ ] Automated: Figure out if HGNC names (symbols) are going to continue entering Mondo from somewhere and, if so, make sure that whatever process is doing that utilizes the same, new lookup process we're introducing in (1) for OMIM.

joeflack4 avatar Apr 08 '24 23:04 joeflack4

Now that I see you spelling this out, isnt this a duplicate to:

https://github.com/monarch-initiative/mondo/issues/7229

It seems what you are saying (and what this issue is about) is to proceed with: https://github.com/monarch-initiative/mondo/issues/7229#issuecomment-2026933750

If its not a duplicate, maybe separate out the things that are not covered by the above that are needed to close this issue.

matentzn avatar Apr 09 '24 08:04 matentzn

On the surface, they don't seem duplicative to me. None of the tasks I put in my comment above seem to overlap with #7229. In #7229 Sabrina is talking about annotations. In this issue, we're talking about definitions. You can see some examples in the OP, e.g. NOLA2 appearing in the def for MONDO:0013519.

However, Sabrina also says in the OP "these annotations should come from OMIM only". Does this hint that we might also do the following?: We'll be removing any references to HGNC names (symbols) anywhere in Mondo.

Is that implied? If it's true, then this issue is obsolete and we can replace it with an issue "Remove HGNC refs from Mondo".

joeflack4 avatar Apr 09 '24 21:04 joeflack4

You are right in that what I was saying is not enough. But the ask is not to remove HGNC refs from Mondo, but to make sure they are correct. Maybe this is the right way to phrase it:

  1. To make sure all the gene links (which are HGNC Ids, not symbols) are up to date, we need #7229. This is not the ask of the OC, but an important part because of the gene id used in Mondo is wrong, we cant make it right by changing the symbol.
  2. To make sure all the gene ids used in Mondo get the correct associated symbol (the up to date one), we need the ncbi gene import to work, which is @hrshdhgd work: https://github.com/monarch-initiative/NCBI-gene-pyobo/issues/1

matentzn avatar Apr 10 '24 08:04 matentzn

@matentzn I think we're on the same page now. I'm not familiar with https://github.com/monarch-initiative/NCBI-gene-pyobo/issues/1 , so IDK if it fully addresses OP's problem, but I'll stay tuned.

joeflack4 avatar Apr 10 '24 20:04 joeflack4

In the meeting we determined that most problems I have mentioned have been solved aside from the fact that we use outdated symbols in definitions, and @monicacecilia considers fixing this a priority.

Here is a cursory analysis on the matter, which gives you a sense of the scope of the issue:

https://docs.google.com/spreadsheets/d/1kbAiMvHz3POrb7ym8nZq00nkASOmLQAXs39ArWBBNbw/edit#gid=0

SPARQL query used
PREFIX obo: <http://purl.obolibrary.org/obo/>
prefix RO: <http://purl.obolibrary.org/obo/RO_>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix oio: <http://www.geneontology.org/formats/oboInOwl#>
prefix def: <http://purl.obolibrary.org/obo/IAO_0000115>
prefix owl: <http://www.w3.org/2002/07/owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

# Get all classes related to diseases that related via RO:0004003 or
SELECT DISTINCT ?cls ?mondo_label ?definition ?gene ?gene_label WHERE
{
  VALUES ?property { RO:0004003 RO:0004004 }
  ?cls rdfs:label ?mondo_label ;
         obo:IAO_0000115 ?definition ;
	rdfs:subClassOf+ <http://purl.obolibrary.org/obo/MONDO_0000001> .
  
 
  ?cls rdfs:subClassOf [
               owl:onProperty ?property ;
               owl:someValuesFrom ?gene ] .
  
  ?gene rdfs:label ?gene_label .
  
 FILTER(!CONTAINS(STR(?definition), STR(?gene_label)))

 FILTER( !isBlank(?cls) && STRSTARTS(str(?cls), "http://purl.obolibrary.org/obo/MONDO_"))
}

I still think we should prioritise #7229 soon (updating the gene references), but this is not a necessary condition for this issues.

Remaining action items

  • [ ] Add QC along the lines of the SPARQL I posted above (fail whenever a definition does not contain the up to date symbols). Add the usual "exception" mechanism, see the QC check for OMIMPS subclass.
  • [ ] Review the tables and fix the definitions, or add an exception.

matentzn avatar Apr 13 '24 07:04 matentzn

Please hold off on review of the gSheet file posted in the comment above for now.

twhetzel avatar Apr 16 '24 01:04 twhetzel