gensim-data icon indicating copy to clipboard operation
gensim-data copied to clipboard

Add cui2vec embeddings

Open souravsingh opened this issue 6 years ago • 22 comments

The embeddings for over 100k medical concepts using data from 60 million patients, 1.7 million journal articles and 20 million notes is up, available here- https://figshare.com/s/00d69861786cd0156d81

Explorer available here- http://ec2-52-14-191-192.us-east-2.compute.amazonaws.com:1234/

souravsingh avatar Apr 06 '18 17:04 souravsingh

Nice find!

piskvorky avatar Apr 07 '18 05:04 piskvorky

Additional information:

  • license: CC BY 4.0
  • paper: https://arxiv.org/abs/1804.01486

menshikh-iv avatar Apr 07 '18 05:04 menshikh-iv

Hey this is my paper, how cool! I'd be happy to contribute these, let me know if they need any clean up first.

beamandrew avatar Apr 16 '18 15:04 beamandrew

Oh, hi @beamandrew, glad to see you here! Please follow the instruction https://github.com/RaRe-Technologies/gensim-data#want-to-add-a-new-corpus-or-model

menshikh-iv avatar Apr 16 '18 16:04 menshikh-iv

Will do! It might be a couple weeks until I can get it together. I'm teaching a deep learning class right now that won't end until May which keeps me pretty busy.

I'm actually having them use the embeddings from this repo in class to build an RNN (which is how I ended up finding this issue).

You can check it out here if you're interested: https://colab.research.google.com/drive/1JsdhsiJQP5JPEEGWWFtOMpQajBj4w1KA

beamandrew avatar Apr 16 '18 16:04 beamandrew

@beamandrew can you give read access for [email protected] please (I can't open your link, lack of permissions)?

menshikh-iv avatar Apr 17 '18 04:04 menshikh-iv

Oops, try this link which should let you view: https://drive.google.com/file/d/1WuoHWf1KyFsNiilbVa7qnKkSDALfch01/view?usp=sharing

beamandrew avatar Apr 17 '18 11:04 beamandrew

Last I checked the actual concept names aren't include in this dataset and/or under the same license, but they are available from a different source which looks legitimately released. I have, in fact, a task to correlate them. Without this correlation, the embeddings discussed here include arbitrary codes instead of the original (concept) words that you see in the online demo.

matanox avatar May 22 '18 09:05 matanox

I currently have some data that will allow for this mapping as @matanster describes from the author of this publication (Section 2).

If anyone is interested I can upload a link to this as I sit next to the author and he has given his permission @jimmyoentung.

hscells avatar May 23 '18 06:05 hscells

Thanks guys.

What we want is for users who download this dataset to be able to use it easily.

If the dataset requires users to jump through hoops, it's not a good fit for gensim-data. The experience of applying / using a dataset has to be streamlined and intuitive, including access and code (not just data). That is why we created this repo, and it's a mandatory part of each new contribution.

@hscells and @matanster what does this extra step mean for users? Can we somehow integrate it directly, so it's transparent to people who want to use cui2vec? Is it necessary?

piskvorky avatar May 23 '18 08:05 piskvorky

The CUI in cui2vec stands for Concept Unique Identifier. A CUI is an identifier for all of the types of synonyms for a particular medical string.

The dataset which I described in my comment is a mapping of CUI to the most commonly used string in the UMLS meta-thesaurus. One may simply replace the CUIs in the pre-trained vector file with terms from this mapping file (although I believe not all CUIs are mapped because the semantic types of the strings were filtered in this particular dataset).

One may use QuickUMLS or MetaMap to map a term to a CUI, then using the method described above map the CUI to the most commonly used term in UMLS or MetaMap.

I'm not exactly sure how the demo in the OP is mapping CUIs to strings, but I believe this is most likely how it would be done. In terms of how it could be integrated @piskvorky, the original data could be modified or this mapping could be performed in a separate step, however like I said, due to the relationship between CUI and the strings associated with that concept (one-to-many) this mapping would preferably be performed as two separate steps.

hscells avatar May 23 '18 22:05 hscells

No problem, as long as the process is clearly described to users, and the dataset ready-to-use out of the box.

piskvorky avatar May 24 '18 07:05 piskvorky

Just curious, any progress on this issue?

juancq avatar Aug 06 '18 01:08 juancq

Hi, any body knows if the dataset 'cui2vec' is available?? @souravsingh share the vector in csv, but i don know how to load that in gensim and start using. Can anyone help me or tell em when the dataset would be ready.

andresrosso avatar Dec 05 '18 00:12 andresrosso

The embeddings for over 100k medical concepts using data from 60 million patients, 1.7 million journal articles and 20 million notes is up, available here- https://figshare.com/s/00d69861786cd0156d81

Explorer available here- http://ec2-52-14-191-192.us-east-2.compute.amazonaws.com:1234/

@souravsingh can i load the CSV in gensim?

Can you tell me how to do that.

andresrosso avatar Dec 05 '18 00:12 andresrosso

Hi everyone,

I am lead author on this paper. Apologies for the radio silence on this request. We are currently working on a revision to the paper/approach that we hope to release this month. I will check back in and try to make it gensim compatible at that time.

beamandrew avatar Dec 05 '18 01:12 beamandrew

@juancq @andresrosso sorry for waiting, I can't say when this will be added BTW you always can load that manually (without api.load, just read the file from disk or s3).

menshikh-iv avatar Dec 14 '18 11:12 menshikh-iv

@beamandrew great, thanks!

menshikh-iv avatar Dec 14 '18 11:12 menshikh-iv

Is there any model using snowmed CT data?

prabhatM avatar Jan 20 '19 07:01 prabhatM

Hi everyone,

I am lead author on this paper. Apologies for the radio silence on this request. We are currently working on a revision to the paper/approach that we hope to release this month. I will check back in and try to make it gensim compatible at that time.

Please share the source code for the evaluation metrics used in this work. I would like to evaluate my own embedding trained on EHRs. Thanks in advanced.

Dhanachandra avatar Mar 18 '19 08:03 Dhanachandra

Hi, any body knows if the dataset 'cui2vec' is available?? @souravsingh share the vector in csv, but i don know how to load that in gensim and start using. Can anyone help me or tell em when the dataset would be ready.

@andresrosso Here are the steps for loading cui2vec in gensim:

  1. Download the pre-trained embeddings from the download url mentioned in http://cui2vec.dbmi.hms.harvard.edu/

  2. Dump the embeddings into a text file in word2vec format in these two steps:

  • Load the csv into pandas dataframe.

    import pandas as pd
    import numpy as np
    
    with open('cui2vec_pretrained.csv') as fd:
          cui2vec_df = pd.read_csv(fd, index_col=0)
    
  • Dump the embeddings(loaded in dataframe) into a text file.

     np.savetxt('cui2vec_pretrained.txt', cui2vec_df.reset_index().values, delimiter=" ", header="{} {}".format(len(cui2vec_df), len(cui2vec_df.columns)), comments="", fmt=["%s"] + ["%.18e"]*len(cui2vec_df.columns))
    
  1. Load the word vectors using gensim.models.keyedvectors.KeyedVectors.
from gensim.models.keyedvectors import KeyedVectors

word_vectors = KeyedVectors.load_word2vec_format('cui2vec_pretrained.txt', binary=False)

# An example
word_vectors.most_similar('C0034079')

Source: https://stackoverflow.com/questions/46297740/how-to-turn-embeddings-loaded-in-a-pandas-dataframe-into-a-gensim-model (Ken Syme's answer)

kaushikacharya avatar Sep 25 '19 14:09 kaushikacharya

Great work, thanks a lot.

andresrosso avatar Sep 25 '19 15:09 andresrosso