kg-covid-19 icon indicating copy to clipboard operation
kg-covid-19 copied to clipboard

Some edges in the SciBite co-occurrence data contain nodes not present in the nodes TSV file

Open justaddcoffee opened this issue 4 years ago • 5 comments

Describe the bug

Some nodes that are present in the edge file for co-occurrence data are not present in the nodes TSV file. (Note that co-occurrence data are not included in the merged graph we are producing and distributing, so this might be a lower priority ticket. )

To Reproduce

These two CURIEs for example appear in edges, but not in the nodes TSV: CORD:8a06bfad2d9c3569f75cfc56d26ba24d21b9a51f CORD:PMC3841265

(venv) $ python run.py transform -s ScibiteCordTransform
(venv) $ grep -c CORD:8a06bfad2d9c3569f75cfc56d26ba24d21b9a51f data/transformed/SciBite-CORD-19/entity_cooccurrence_*.tsv 
./entity_cooccurrence_edges.tsv:767
./entity_cooccurrence_nodes.tsv:0
(venv) $ grep -c CORD:PMC3841265 data/transformed/SciBite-CORD-19/entity_cooccurrence*tsv
data/transformed/SciBite-CORD-19/entity_cooccurrence_edges.tsv:119
data/transformed/SciBite-CORD-19/entity_cooccurrence_nodes.tsv:0

Expected behavior

These CURIEs should have entries in the co-occurrence TSV files: CORD:8a06bfad2d9c3569f75cfc56d26ba24d21b9a51f CORD:PMC3841265

Version

145c7bba5b567ca838c4dc1fd1fb2d3342e429e8

Additional context

Co-occurrence data are not included in the merged graph we are producing and distributing, so this might be a lower priority ticket.

justaddcoffee avatar Jun 10 '20 17:06 justaddcoffee

These are the missing nodes:

missing subject ComplexPortal:CPX-5682 missing subject ComplexPortal:CPX-5683 missing subject ComplexPortal:CPX-5685 missing subject ComplexPortal:CPX-5686 missing subject ComplexPortal:CPX-5687 missing subject ComplexPortal:CPX-5688 missing subject ComplexPortal:CPX-5689 missing subject ComplexPortal:CPX-5690 missing subject ComplexPortal:CPX-5691 missing subject ComplexPortal:CPX-5692 missing subject ComplexPortal:CPX-5742 missing subject UniProtKB:A0A679G4B7 missing subject UniProtKB:A0A679G4C7 missing subject UniProtKB:A0A679G4D8 missing subject UniProtKB:A0A679GC99 missing subject UniProtKB:A0A679H0U6 missing subject UniProtKB:A0A679HAG2 missing subject UniProtKB:A0A679HE24 missing subject UniProtKB:A0A6B9UY63 missing subject UniProtKB:A0A6B9VLF3 missing subject UniProtKB:A0A6B9VLF5 missing subject UniProtKB:A0A6B9VNL0 missing subject UniProtKB:A0A6B9VNN9 missing subject UniProtKB:A0A6B9VP19 missing subject UniProtKB:A0A6B9VSU5 missing subject UniProtKB:A0A6B9VSV5 missing subject UniProtKB:A0A6B9W0R7 missing subject UniProtKB:A0A6B9WFC7 missing subject UniProtKB:A0A6B9WIH5 missing subject UniProtKB:A0A6B9WIK4 missing subject UniProtKB:A0A6B9WIM9 missing subject UniProtKB:A0A6B9XUA0 missing subject UniProtKB:A0A6B9XX45 missing subject UniProtKB:A0A6C0M7K8 missing subject UniProtKB:A0A6C0M8P6 missing subject UniProtKB:A0A6C0N5E8 missing subject UniProtKB:A0A6C0N6C5 missing subject UniProtKB:A0A6C0N6E9 missing subject UniProtKB:A0A6C0NA72 missing subject UniProtKB:A0A6C0QEL8 missing subject UniProtKB:A0A6C0QEM7 missing subject UniProtKB:A0A6C0QEN3 missing subject UniProtKB:A0A6C0QFP9 missing subject UniProtKB:A0A6C0QFQ7 missing subject UniProtKB:A0A6C0R287 missing subject UniProtKB:A0A6C0R294 missing subject UniProtKB:A0A6C0RS15 missing subject UniProtKB:A0A6C0RSH1 missing subject UniProtKB:A0A6C0T6V0 missing subject UniProtKB:A0A6C0T6Z7 missing subject UniProtKB:A0A6C0VCY4 missing subject UniProtKB:A0A6C0WXA2 missing subject UniProtKB:A0A6C0X332 missing subject UniProtKB:A0A6C1BAC9 missing subject UniProtKB:A0A6C1EJY3 missing subject UniProtKB:A0A6C1F1G5 missing subject UniProtKB:P0DTD1-PRO_0000449644

namin avatar Jun 26 '20 21:06 namin

thanks @namin

justaddcoffee avatar Jun 26 '20 22:06 justaddcoffee

Just a heads-up that there were some errors on our side in reading the data. So you can feel free to ignore the list above. Things seem to be working now.

namin avatar Jul 03 '20 02:07 namin

@justaddcoffee The edges now refer to a bunch of CHEMBL curies that don't appear in the nodes:

CHEMBL.TARGET:CHEMBL391
CHEMBL.TARGET:CHEMBL4303835
CHEMBL.TARGET:CHEMBL614058
CHEMBL.TARGET:CHEMBL612558
CHEMBL.TARGET:CHEMBL4303840

gregr avatar Jul 03 '20 02:07 gregr

Thanks for spotting this @gregr - I'm going to break this out into a new ticket

justaddcoffee avatar Jul 04 '20 14:07 justaddcoffee