CEVOpen icon indicating copy to clipboard operation
CEVOpen copied to clipboard

compound synonyms and stereochemistry

Open petermr opened this issue 5 years ago • 2 comments

The compound names in table columns are frequently ambiguous. The first table is https://github.com/petermr/CEVOpen/blob/master/articleAnalysis/oil186/raw/thyme.tsv

Compound	Compound_dictionary_lookpup	E2.0_compound_identifiers	notes	wikidata_identifier
alpha-Thujene	(-)-alpha-thujene ; (+)-alpha-thujene	C764 ; C786	stereo-isomers of the compounds are there.	Q27121815 ; Q27121804
alpha-Pinene	alpha-Pinene	C2849	Also, stereo-isomers of the compounds are there.	Q27104380
beta-Pinene	beta-Pinene	C349	Also, stereo-isomers of the compounds are there.	
beta-Myrcene	beta-Myrcene	C345		Q424577
alpha-Phellandrene	alpha-Phellandrene	C2848		Q19606345
Carene<δ-2->	2-carene	C1720	Lookup is of '2-carene'	
D-Limonene	(+)-limonene	C792		Q27888324
beta-Phellandrene	beta-Phellandrene	C3426		Q19606727
para-Cymene	cymene	C4118	Other cymene are present as 'm-cymenene', 'dehydro-p-cymene', 'o-cymene',	Q284072
gamma-Terpinene	beta-terpinene	C355	Present as beta-terpinene	Q23057921
Terpineol	1-terpineol	C1482		Q27276701
Terpinen-4-ol	(+)-terpinen-4-ol	C795		Q27280168
Thymol			not present.	
Caryophyllene	(z)-caryophyllene ; 9-epi-(E)-caryophyllene ; alpha-caryophyllene	C1255 ; C2705 ; C2915	Stereo-isomers are present	NA ; Q27137093 ; Q1995108

petermr avatar Nov 07 '19 11:11 petermr

implementing

  • add <synonym> child elements to dictionary <entry> elements
  • lookup unknowns in wikidata and identify synonyms of existing entries

Will start by creating a bag of unknown terms.

petermr avatar Nov 24 '19 18:11 petermr

analysing isomerism and synonyms

We need to sort compounds by WikidataID and PubchemCID to determine synonyms. Example:

para-cymen-7-ol				325	4-Isopropylbenzyl alcohol	
p-cymen-7-ol	p-cymen-7-ol				325	4-Isopropylbenzyl alcohol	

These two entries relate to the same CID so should be grouped together. PMR will then decide which is the best to keep

cuminaldehyde	cuminaldehyde	cuminaldehyde	Q419952		326	4-Isopropylbenzaldehyde	
cuminal	cuminal	cuminaldehyde	Q419952		326	4-Isopropylbenzaldehyde	
octanal	

has both Wikidata and Pubchem

sort TSV file by WikidataID and remove synonyms

@ambarishK will sort table in a spreadsheet on WikidataID column. notFoundWIKIDATASortedPubChem.tsv PMR will then edit this manually

sort TSV file by PubchemCID and remove synonyms

@ambarishK will sort table in a spreadsheet on PubChemID column. notFoundWIKIDATAPubChemSorted.tsv PMR will then edit this manually

The recommitted files will normalize to a single reference for Wikidata and for Pubchem. PMR will then merge possible conflicts and fuzziness.

petermr avatar Dec 05 '19 11:12 petermr