kg-covid-19
kg-covid-19 copied to clipboard
Some edges in the SciBite co-occurrence data contain nodes not present in the nodes TSV file
Describe the bug
Some nodes that are present in the edge file for co-occurrence data are not present in the nodes TSV file. (Note that co-occurrence data are not included in the merged graph we are producing and distributing, so this might be a lower priority ticket. )
To Reproduce
These two CURIEs for example appear in edges, but not in the nodes TSV:
CORD:8a06bfad2d9c3569f75cfc56d26ba24d21b9a51f
CORD:PMC3841265
(venv) $ python run.py transform -s ScibiteCordTransform
(venv) $ grep -c CORD:8a06bfad2d9c3569f75cfc56d26ba24d21b9a51f data/transformed/SciBite-CORD-19/entity_cooccurrence_*.tsv
./entity_cooccurrence_edges.tsv:767
./entity_cooccurrence_nodes.tsv:0
(venv) $ grep -c CORD:PMC3841265 data/transformed/SciBite-CORD-19/entity_cooccurrence*tsv
data/transformed/SciBite-CORD-19/entity_cooccurrence_edges.tsv:119
data/transformed/SciBite-CORD-19/entity_cooccurrence_nodes.tsv:0
Expected behavior
These CURIEs should have entries in the co-occurrence TSV files:
CORD:8a06bfad2d9c3569f75cfc56d26ba24d21b9a51f
CORD:PMC3841265
Version
145c7bba5b567ca838c4dc1fd1fb2d3342e429e8
Additional context
Co-occurrence data are not included in the merged graph we are producing and distributing, so this might be a lower priority ticket.
These are the missing nodes:
missing subject ComplexPortal:CPX-5682 missing subject ComplexPortal:CPX-5683 missing subject ComplexPortal:CPX-5685 missing subject ComplexPortal:CPX-5686 missing subject ComplexPortal:CPX-5687 missing subject ComplexPortal:CPX-5688 missing subject ComplexPortal:CPX-5689 missing subject ComplexPortal:CPX-5690 missing subject ComplexPortal:CPX-5691 missing subject ComplexPortal:CPX-5692 missing subject ComplexPortal:CPX-5742 missing subject UniProtKB:A0A679G4B7 missing subject UniProtKB:A0A679G4C7 missing subject UniProtKB:A0A679G4D8 missing subject UniProtKB:A0A679GC99 missing subject UniProtKB:A0A679H0U6 missing subject UniProtKB:A0A679HAG2 missing subject UniProtKB:A0A679HE24 missing subject UniProtKB:A0A6B9UY63 missing subject UniProtKB:A0A6B9VLF3 missing subject UniProtKB:A0A6B9VLF5 missing subject UniProtKB:A0A6B9VNL0 missing subject UniProtKB:A0A6B9VNN9 missing subject UniProtKB:A0A6B9VP19 missing subject UniProtKB:A0A6B9VSU5 missing subject UniProtKB:A0A6B9VSV5 missing subject UniProtKB:A0A6B9W0R7 missing subject UniProtKB:A0A6B9WFC7 missing subject UniProtKB:A0A6B9WIH5 missing subject UniProtKB:A0A6B9WIK4 missing subject UniProtKB:A0A6B9WIM9 missing subject UniProtKB:A0A6B9XUA0 missing subject UniProtKB:A0A6B9XX45 missing subject UniProtKB:A0A6C0M7K8 missing subject UniProtKB:A0A6C0M8P6 missing subject UniProtKB:A0A6C0N5E8 missing subject UniProtKB:A0A6C0N6C5 missing subject UniProtKB:A0A6C0N6E9 missing subject UniProtKB:A0A6C0NA72 missing subject UniProtKB:A0A6C0QEL8 missing subject UniProtKB:A0A6C0QEM7 missing subject UniProtKB:A0A6C0QEN3 missing subject UniProtKB:A0A6C0QFP9 missing subject UniProtKB:A0A6C0QFQ7 missing subject UniProtKB:A0A6C0R287 missing subject UniProtKB:A0A6C0R294 missing subject UniProtKB:A0A6C0RS15 missing subject UniProtKB:A0A6C0RSH1 missing subject UniProtKB:A0A6C0T6V0 missing subject UniProtKB:A0A6C0T6Z7 missing subject UniProtKB:A0A6C0VCY4 missing subject UniProtKB:A0A6C0WXA2 missing subject UniProtKB:A0A6C0X332 missing subject UniProtKB:A0A6C1BAC9 missing subject UniProtKB:A0A6C1EJY3 missing subject UniProtKB:A0A6C1F1G5 missing subject UniProtKB:P0DTD1-PRO_0000449644
thanks @namin
Just a heads-up that there were some errors on our side in reading the data. So you can feel free to ignore the list above. Things seem to be working now.
@justaddcoffee The edges now refer to a bunch of CHEMBL curies that don't appear in the nodes:
CHEMBL.TARGET:CHEMBL391
CHEMBL.TARGET:CHEMBL4303835
CHEMBL.TARGET:CHEMBL614058
CHEMBL.TARGET:CHEMBL612558
CHEMBL.TARGET:CHEMBL4303840
Thanks for spotting this @gregr - I'm going to break this out into a new ticket