cui2vec icon indicating copy to clipboard operation
cui2vec copied to clipboard

Lookup dictionary for pretrained embedding

Open victorconan opened this issue 4 years ago • 4 comments

Hi Andrew,

Do you have a lookup dictionary for the pretrained embeddings? I saw in the embedding file, the "medical concepts" are in format of "CXXXX", not sure if they are ICD codes, procedure codes or something else.

Thanks!

victorconan avatar Oct 09 '20 14:10 victorconan

Hello Victor,

I have been looking into this work recently, I think that CUI mapping files / scripts to convert can be found in the repository for embeddings: https://github.com/clinicalml/embeddings/tree/master/eval

Cheers

reality avatar Nov 05 '20 17:11 reality

the "medical concepts" are in format of "CXXXX", not sure if they are ICD codes, procedure codes or something else

These are UMLS concept unique identifier(CUI)

Examples from https://arxiv.org/pdf/1804.01486.pdf

Primary condition: premature infant (CUI: C0021294) Comorbidity:
bronchopulmonary dysplasia (CUI: C0006287)

UMLS CUIs can be browsed on https://uts.nlm.nih.gov/metathesaurus.html (N.B. You would need to register yourself first).

kaushikacharya avatar Nov 06 '20 04:11 kaushikacharya

Came across this post while looking for information on the meaning of the columns in the cui2vec_pretrained.csv file. The columns are named v1, v2 ... v500. Where can we get information on what do these 500 columns stand for?

If we were to load this csv file into a database, what kind of schema should we create? (Or does it even make sense to load this into a database in the first place?) I have read the https://arxiv.org/pdf/1804.01486.pdf multiple times but could not get any information on the structure of this pretrained csv file. Any help is greatly appreciated.

KrishnaPG avatar Dec 13 '20 14:12 KrishnaPG

The columns are named v1, v2 ... v500. Where can we get information on what do these 500 columns stand for?

v1,...,v500 are the 500 dimensional vector embedding for the CUIs.

Quoting the paper from Section 4.1:

The 500-dimensional word2vec style embeddings using the combined data are referred to
as the cui2vec embeddings in all subsequent experiments.

Loading cui2vec: You can use gensim as explained in https://github.com/RaRe-Technologies/gensim-data/issues/25#issuecomment-535042220

As a pre-requisite, you should read about word embeddings e.g. word2vec. That will help you to understand vector embedding of text.

kaushikacharya avatar Dec 16 '20 05:12 kaushikacharya