Athena
Athena copied to clipboard
Review and fix "required" and "default" flags for vocab download
A recent forum post highlighted the issue that some vocabularies are set to "required" and as such will be always part of a download bundle / cannot be deselected. In particular among these, the vocabularies Korean Revenue Code, OSM and SOPT seemed a little off to be indispensable for an OMOP CDM.
These are the ones currently marked as "OMOP required":
vocabulary_id_v5 -- | CDM Cohort Type Concept Class Condition Status Condition Type Cost Cost Type Death Type Device Type Domain Drug Type Episode Korean Revenue Code Meas Type Metadata None Note Type Observation Type Obs Period Type OSM Plan Plan Stop Reason Procedure Type Relationship SOPT Sponsor Type Concept UB04 Point of Origin UB04 Pri Typ of Adm UB04 Pt dis status UB04 Typ bill UCUM US Census Visit Visit Type Vocabulary
I guess we can remove all the ones with a "Type" in their name except for the new Type Concepts as they have replaced them. The respective concepts per vocabulary ID could probably also be retired.
There was also the notion to mark more vocabularies as default that have standard concepts.
Here are the ones with standard concepts or classifications and their respective count together with a proposal how to set the default and required flags:
vocabulary_id | description | S / C | row_count | default now | default future | required now | required future |
---|---|---|---|---|---|---|---|
ABMS | Provider Specialty (American Board of Medical Specialties) | S | 85 | X | X | ||
AMT | Australian Medicines Terminology (NEHTA) | S | 6839 | ||||
APC | Ambulatory Payment Classification (CMS) | S | 715 | ||||
ATC | WHO Anatomic Therapeutic Chemical Classification | C | 6509 | X | X | ||
BDPM | Public Database of Medications (Social-Sante) | S | 1106 | ||||
Cancer Modifier | Diagnostic Modifiers of Cancer (OMOP) | S | 3251 | ||||
CDM | OMOP Common DataModel | S | 1045 | X | X | X | |
CDT | Current Dental Terminology (ADA) | S | 869 | ||||
CMS Place of Service | Place of Service Codes for Professional Claims (CMS) | S | 51 | X | X | ||
Cohort | Legacy OMOP HOI or DOI cohort | C | 78 | ||||
Condition Status | OMOP Condition Status | S | 22 | X | X | X | |
Cost | OMOP Cost | S | 51 | X | X | ||
CPT4 | Current Procedural Terminology version 4 (AMA) | C | 3492 | X | X | ||
CPT4 | Current Procedural Terminology version 4 (AMA) | S | 12922 | X | X | ||
Currency | International Currency Symbol (ISO 4217) | S | 180 | X | X | ||
CVX | CDC Vaccine Administered CVX (NCIRD) | S | 217 | ||||
DA_France | Disease Analyzer France (IQVIA) | S | 6366 | ||||
dm+d | Dictionary of Medicines and Devices (NHS) | S | 21071 | ||||
DRG | Diagnosis-related group (CMS) | S | 752 | ||||
EphMRA ATC | Anatomical Classification of Pharmaceutical Products (EphMRA) | C | 895 | ||||
Episode | OMOP Episode | S | 14 | X | X | X | |
ETC | Enhanced Therapeutic Classification (FDB) | C | 2755 | ||||
Ethnicity | OMOP Ethnicity | S | 2 | X | X | ||
Gemscript | Gemscript (Resip) | S | 64761 | ||||
Gender | OMOP Gender | S | 2 | X | X | ||
GGR | Commented Drug Directory (BCFI) | S | 751 | ||||
GRR | Global Reference Repository (IQVIA) | S | 138739 | ||||
HCPCS | Healthcare Common Procedure Coding System (CMS) | S | 8427 | X | X | ||
HemOnc | HemOnc | C | 367 | ||||
HemOnc | HemOnc | S | 2015 | ||||
HES Specialty | Hospital Episode Statistics Specialty (NHS) | S | 57 | ||||
ICD10PCS | ICD-10 Procedure Coding System (CMS) | S | 194874 | ||||
ICD9Proc | International Classification of Diseases, Ninth Revision, Clinical Modification, Volume 3 (NCHS) | S | 2223 | X | X | ||
ICDO3 | International Classification of Diseases for Oncology, Third Edition (WHO) | S | 56972 | ||||
Indication | Indications and Contraindications (FDB) | C | 4739 | ||||
ISBT | Information Standard for Blood and Transplant 128 Product (ICCBBA) | S | 17336 | ||||
ISBT Attribute | Information Standard for Blood and Transplant 128 Product Attribute (ICCBBA) | C | 1657 | ||||
JMDC | Japan Medical Data Center Drug Code (JMDC) | S | 1313 | ||||
KDC | Korean Drug Code (HIRA) | S | 112 | ||||
KNHIS | Korean National Health Information System | S | 3 | ||||
Korean Revenue Code | Korean Revenue Code | S | 7 | X | |||
LOINC | Logical Observation Identifiers Names and Codes (Regenstrief Institute) | C | 48305 | X | X | ||
LOINC | Logical Observation Identifiers Names and Codes (Regenstrief Institute) | S | 110702 | X | X | ||
LPD_Australia | Longitudinal Patient Data Australia (IQVIA) | S | 1620 | ||||
MDC | Major Diagnostic Categories (CMS) | S | 26 | ||||
MedDRA | Medical Dictionary for Regulatory Activities (MSSO) | C | 76939 | ||||
Medicare Specialty | Medicare provider/supplier specialty codes (CMS) | S | 112 | X | X | ||
Metadata | Metadata | S | 1 | X | X | X | |
MMI | Modernizing Medicine (MMI) | S | 4 | ||||
NAACCR | Data Standards & Data Dictionary Volume II (NAACCR) | S | 26105 | ||||
NCIt | NCI Thesaurus (National Cancer Institute) | S | 1899 | ||||
NDC | National Drug Code (FDA and manufacturers) | S | 11219 | X | X | ||
Nebraska Lexicon | Nebraska Lexicon | S | 4187 | ||||
NFC | New Form Code (EphMRA) | C | 692 | ||||
NUCC | National Uniform Claim Committee Health Care Provider Taxonomy Code Set (NUCC) | S | 674 | X | X | ||
OMOP Extension | OMOP Extension (OHDSI) | S | 553 | X | X | ||
OMOP Genomic | OMOP Genomic vocabulary | S | 79791 | ||||
OPCS4 | OPCS Classification of Interventions and Procedures version 4 (NHS) | S | 2373 | ||||
OSM | OpenStreetMap | S | 203339 | X | |||
PCORNet | National Patient-Centered Clinical Research Network (PCORI) | S | 2 | ||||
Plan | Health Plan - contract to administer healthcare transactions by the payer, facilitated by the sponsor | S | 11 | X | X | X | |
Plan Stop Reason | Plan Stop Reason - Reason for termination of the Health Plan | S | 13 | X | X | X | |
PPI | AllOfUs_PPI (Columbia) | S | 2120 | ||||
Provider | OMOP Provider | S | 6 | X | |||
Race | Race and Ethnicity Code Set (USBC) | S | 50 | X | X | ||
Relationship | OMOP Relationship | S | 14 | X | X | X | |
Revenue Code | UB04/CMS1450 Revenue Codes (CMS) | S | 538 | X | X | ||
RxNorm | RxNorm (NLM) | C | 35087 | X | X | ||
RxNorm | RxNorm (NLM) | S | 148139 | X | X | ||
RxNorm Extension | RxNorm Extension (OHDSI) | S | 1819247 | X | X | ||
SMQ | Standardised MedDRA Queries (MSSO) | C | 318 | ||||
SNOMED | Systematic Nomenclature of Medicine - Clinical Terms (IHTSDO) | S | 540590 | X | X | ||
SNOMED Veterinary | SNOMED Veterinary | S | 31994 | ||||
SOPT | Source of Payment Typology (PHDSC) | S | 162 | X | X | X | |
SPL | Structured Product Labeling (FDA) | C | 573209 | X | X | ||
Sponsor | Sponsor - institution or individual financing healthcare transactions | S | 6 | X | X | X | X |
Type Concept | OMOP Type Concept | S | 79 | X | X | X | |
UB04 Pri Typ of Adm | UB04 Claim Inpatient Admission Type Code (CMS) | S | 6 | X | X | ||
UB04 Typ bill | UB04 Type of Bill - Institutional (USHIK) | S | 4 | X | X | ||
UCUM | Unified Code for Units of Measure (Regenstrief Institute) | S | 922 | X | X | X | |
UK Biobank | UK Biobank | C | 292 | ||||
UK Biobank | UK Biobank | S | 3837 | ||||
US Census | United States Census Bureau | S | 13 | X | X | ||
Visit | OMOP Visit | S | 19 | X | X | X |
Please review @cgreich and @fdefalco !
Thanks - mik
I agree with most of the entries in the table. Ones I would question if they should be required in the future:
US Census UB04 Typ bill UB04 Pri Typ of Adm Sponsor
We could also remove the idea of "Required" in the interest of transparency and have a note appear on the page that a vocabulary is "Highly Recommended" when it is what we currently consider "Required" but still afford the user the opportunity to deselect it.
Then we would only have a boolean for "Default" for each vocabulary that can be edited by the user when creating their vocabulary download.
Hi, thanks for the input, @fdefalco I think we have to keep required for very foundational data that you would need for a CDM to function (CDM, Metadata, a couple others). The ones you listed we would keep in for ease of use (they are small and in most cases needed), as it would not really make sense to NOT load them. @cgreich , can you provide more input? thanks - Mik
Is there a timeline for implementation of this particular feature?
I had hoped. @cgreich would give us his final "placet". I would then hand over the above list for processing by the vocab team and it should go to Athena with the next release.
bumping up this issue, @cgreich and @fdefalco
What is the verdict?
I would also add the CVX vocabulary to default. And we have that funny OMOP supplier vocabulary with one non-standard concept in it... Do we need that?
@ssuvorov-fls - could you check, if the above new settings would somewhat break something once they end up in Athena? Can we test run this in any QA instance?
Korean Revenue Code, OSM and SOPT seemed a little off to be indispensable for an OMOP CDM
I think, the unspoken convention was to include everything that goes to the Domain missing its respective tables so that you don't miss the concepts for such "service" things as gender_concept_id, unit_concept_id, modifier_concept_id, route_concept_id, etc. Because it's not really obvious what vocabularies to pick if you want to add one more table/domain to your CDM. Region_concept_id somehow didn't materialize into a field but explains why OSM and US Cencus are there.
I guess we can remove all the ones with a "Type" in their name except for the new Type Concepts as they have replaced them
I wouldn't do it because the users that are updating their ETLs from some old vocabulary versions will just lose the concepts that appear it their mappings. I would never do it for the "service" small vocabularies.
Here are the ones with standard concepts or classifications and their respective count together with a proposal how to set the default and required flags
I didn't get the logic behind. How the gender is more important than the race? And why Sponsor is better than a Geography? We need to come up with the clear rules.
There was also the notion to mark more vocabularies as default that have standard concepts
Don't think it's a great choice before we cleaned up the EAV data. Otherwise, people will start map to UKB, PPI and NAACCR. And it's already the case.
I think, the unspoken convention was to include everything that goes to the Domain missing its respective tables so that you don't miss the concepts for such "service" things as gender_concept_id, unit_concept_id, modifier_concept_id, route_concept_id, etc. Because it's not really obvious what vocabularies to pick if you want to add one more table/domain to your CDM. Region_concept_id somehow didn't materialize into a field but explains why OSM and US Cencus are there.
OSM is however one of the reasons, this whole discussion started... I guess I would still take it out of "required".
I wouldn't do it because the users that are updating their ETLs from some old vocabulary versions will just lose the concepts that appear it their mappings. I would never do it for the "service" small vocabularies.
hmm... have we mapped old type concepts over to the new ones? If so, it would make sense to keep them. but otherwise aren't they simply useless now and all non-standard?
I didn't get the logic behind. How the gender is more important than the race? And why Sponsor is better than a Geography? We need to come up with the clear rules.
Well, this is derived a little from how it was before. Gender is really indispensable, whereas Race & Ethnicity is, as we know, US centric... and they are still marked as default, so most people will keep them in their download. They just have a choice to deselect.
There was also the notion to mark more vocabularies as default that have standard concepts
Don't think it's a great choice before we cleaned up the EAV data. Otherwise, people will start map to UKB, PPI and NAACCR. And it's already the case.
Of course we would not follow that notion blindly and hence the above are not marked as default. But you cannot prevent people from selecting them for download, unless we would make them something like license restricted (only not license but something else).
The original intent of the discussion was to promote transparency and flexibility in vocabulary download. As it stands, vocabularies that are not listed or selected are included in the download, so for transparency, they should be listed and selected by default. For flexibility the user can have the option to unselect vocabularies. I'm not sure what benefit preventing a user from unselecting a vocabulary would provide, if you reject defaults you should be doing so for a well understood reason. Perhaps a warning on the page that says 'Default vocabularies are selected to provide important concepts to most ETL processes, remove them from the selected vocabularies at your own risk.' :)
@cgreich has an even stricter view on this. I think he used the word "dogmatic". Let's hear him out. (Christian, one exception to the rule should be vocabularies that have standard items but are also license restricted such as CDT or ISBT).
I think Patrick echoed my concern on transparency here: https://forums.ohdsi.org/t/osm-vocabulary/16303/11
Are we debating here or there?
We are discussing the changes to be made as part of this issue here, informed by the conversation there. I don't think there is any debate regarding the need for transparency of vocabularies that are included in a download. I imagine the remaining debate is whether or not to provide the user the ability to control whether not 'default' vocabularies are included. My vote is that the user is provide control with a stern warning about why defaults should be left as is.
@fdefalco:
Hang on a sec. Right now, the thinking is we have three categories (not two):
- Default. This is what everybody always has to have, since it is part of the OMOP CDM. These are the standard and classification concepts. Not vocabularies.
- Recommended. These are vocabularies (beyond their standard concepts) which are pre-clicked, but can be unclicked. The problem is what they are. It strongly depends on the geography of the data source what should be recommended. In the US, NDC would be recommended (only the devices are standard, until we figure that out), but, say, in France NDC is a huge corpus of useless concepts.
- Rest. These are the vocabularies that are not recommended. They are no checked, but could be clicked.
The proprietary vocabularies are in the Rest category, since they need to be individually clicked and processed anyway.
We will have to change Athena to always include all standard concepts (easy), and create different sets of recommended vocabularies (North America, Europe, Rest of World maybe). Not a big deal, but will require some work.