cui2vec
cui2vec copied to clipboard
Lookup dictionary for pretrained embedding
Hi Andrew,
Do you have a lookup dictionary for the pretrained embeddings? I saw in the embedding file, the "medical concepts" are in format of "CXXXX", not sure if they are ICD codes, procedure codes or something else.
Thanks!
Hello Victor,
I have been looking into this work recently, I think that CUI mapping files / scripts to convert can be found in the repository for embeddings: https://github.com/clinicalml/embeddings/tree/master/eval
Cheers
the "medical concepts" are in format of "CXXXX", not sure if they are ICD codes, procedure codes or something else
These are UMLS concept unique identifier(CUI)
Examples from https://arxiv.org/pdf/1804.01486.pdf
Primary condition: premature infant (CUI: C0021294) Comorbidity:
bronchopulmonary dysplasia (CUI: C0006287)
UMLS CUIs can be browsed on https://uts.nlm.nih.gov/metathesaurus.html (N.B. You would need to register yourself first).
Came across this post while looking for information on the meaning of the columns in the cui2vec_pretrained.csv
file. The columns are named v1, v2 ... v500
. Where can we get information on what do these 500 columns stand for?
If we were to load this csv file into a database, what kind of schema should we create? (Or does it even make sense to load this into a database in the first place?) I have read the https://arxiv.org/pdf/1804.01486.pdf
multiple times but could not get any information on the structure of this pretrained csv
file. Any help is greatly appreciated.
The columns are named
v1, v2 ... v500
. Where can we get information on what do these 500 columns stand for?
v1,...,v500 are the 500 dimensional vector embedding for the CUIs.
Quoting the paper from Section 4.1:
The 500-dimensional word2vec style embeddings using the combined data are referred to
as the cui2vec embeddings in all subsequent experiments.
Loading cui2vec: You can use gensim as explained in https://github.com/RaRe-Technologies/gensim-data/issues/25#issuecomment-535042220
As a pre-requisite, you should read about word embeddings e.g. word2vec. That will help you to understand vector embedding of text.