deft_corpus icon indicating copy to clipboard operation
deft_corpus copied to clipboard

Missing relations

Open davletov-aa opened this issue 5 years ago • 4 comments

I found 266 examples (context-windows) which have tokens with root_ids marked as "0" and tag_id, say TXXX, but there are no tokens with root_id TXXX in example in train and dev set.

For example there is such T105 tokens:

data/source_txt/t3_physics_2_101.deft TOKEN ROOT_ID TAG_ID RELATION 3161 -1 -1 0 . -1 -1 0 Another -1 -1 0 is -1 -1 0 what -1 -1 0 Democritus -1 -1 0 in -1 -1 0 particular -1 -1 0 believed -1 -1 0 — -1 -1 0 that -1 -1 0 there 0 T106 0 is 0 T106 0 a 0 T106 0 smallest 0 T106 0 unit 0 T106 0 that 0 T106 0 can 0 T106 0 not 0 T106 0 be 0 T106 0 further 0 T106 0 subdivided 0 T106 0 . -1 -1 0 Democritus -1 -1 0 called -1 -1 0 this T106 T194 Refers-To the 0 T105 0 atom 0 T105 0 . -1 -1 0 We -1 -1 0 now -1 -1 0 know -1 -1 0 that -1 -1 0 atoms -1 -1 0 themselves -1 -1 0 can -1 -1 0 be -1 -1 0 subdivided -1 -1 0 , -1 -1 0 but -1 -1 0 their -1 -1 0 identity -1 -1 0 is -1 -1 0 destroyed -1 -1 0 in -1 -1 0 the -1 -1 0 process -1 -1 0 , -1 -1 0 so -1 -1 0 the -1 -1 0 Greeks -1 -1 0 were -1 -1 0 correct -1 -1 0 in -1 -1 0 a -1 -1 0 respect -1 -1 0 . -1 -1 0

davletov-aa avatar Feb 19 '20 13:02 davletov-aa

Thanks for reporting - I'm looking into this now. It has to do with the fix we settled on for long distance relationships (i.e. Secondary Def --> Definition --> Term), which was to mark only the final tag in the relationship as the root, so that you would have relationships in the .deft files like this, where the Term is the root: (Secondary Def, T1, T2, Supplements) (Definition, T2, T3, Direct Defines) (Term, T3, 0, 0)

sashaspala avatar Feb 19 '20 16:02 sashaspala

I take it back - on inspection this is actually a problem with overlapping relationships. In this case, there was a referential-definition (this) that "refers-to" the definition (there is a smallest unit that cannot be further subdivided) and also "indirect-defines" the term (the atom). Someone brought this up in the forums yesterday and we're aware of the problem. I'm working on finding a fix right now that handles this scenario without undermining our existing data format.

sashaspala avatar Feb 19 '20 17:02 sashaspala

Hi, there are still the problems with missing relations in train and dev sets (it seems I have an actual state of data, please check it): {'data/source_txt/t3_physics_2_101.deft': {'T105', 'T109', 'T134', 'T145', 'T31'}, 'data/source_txt/t6_sociology_1_101.deft': {'T125', 'T142', 'T58'}, 'data/source_txt/t1_biology_1_505.deft': {'T189', 'T195', 'T241', 'T246', 'T282', 'T283', 'T72', 'T74', 'T86'}, 'data/source_txt/t2_history_0_0.deft': {'T151', 'T162', 'T47', 'T81', 'T95'}, 'data/source_txt/t6_sociology_0_101.deft': {'T76', 'T98'}, 'data/source_txt/t2_history_2_101.deft': {'T111', 'T131'}, 'data/source_txt/t7_government_1_101.deft': {'T103', 'T116'}, 'data/source_txt/t7_government_1_404.deft': {'T13'}, 'data/source_txt/t1_biology_0_303.deft': {'T129', 'T131', 'T176', 'T26', 'T296', 'T79', 'T82', 'T9', 'T94'}, 'data/source_txt/t1_biology_1_404.deft': {'T113', 'T173', 'T194', 'T195', 'T223', 'T231', 'T36', 'T7'}, 'data/source_txt/t5_economic_1_0.deft': {'T103', 'T140', 'T154', 'T50', 'T73', 'T89', 'T95'}, 'data/source_txt/t1_biology_2_404.deft': {'T113', 'T150', 'T167', 'T205', 'T228', 'T295', 'T299', 'T42'}, 'data/source_txt/t4_psychology_2_0.deft': {'T127', 'T204', 'T209', 'T232', 'T38'}, 'data/source_txt/t3_physics_0_101.deft': {'T157', 'T174', 'T39'}, 'data/source_txt/t7_government_0_303.deft': {'T20'}, 'data/source_txt/t5_economic_0_202.deft': {'T137'}, 'data/source_txt/t5_economic_1_202.deft': {'T47'}, 'data/source_txt/t4_psychology_0_303.deft': {'T17'}, 'data/source_txt/t7_government_1_0.deft': {'T16'}, 'data/source_txt/t1_biology_2_606.deft': {'T207', 'T259', 'T28', 'T37', 'T59', 'T83'}, 'data/source_txt/t4_psychology_1_0.deft': {'T123', 'T165', 'T200', 'T216', 'T221', 'T32'}, 'data/source_txt/t2_history_2_0.deft': {'T146', 'T151', 'T179', 'T25', 'T53', 'T76'}, 'data/source_txt/t7_government_1_303.deft': {'T13'}, 'data/source_txt/t1_biology_1_303.deft': {'T105', 'T15', 'T86'}, 'data/source_txt/t7_government_0_202.deft': {'T31', 'T35'}, 'data/source_txt/t1_biology_0_101.deft': {'T131', 'T261', 'T82'}, 'data/source_txt/t4_psychology_2_101.deft': {'T198', 'T31', 'T7'}, 'data/source_txt/t4_psychology_0_202.deft': {'T102', 'T21', 'T35', 'T36', 'T83'}, 'data/source_txt/t5_economic_0_101.deft': {'T1', 'T180', 'T7', 'T86'}, 'data/source_txt/t2_history_1_0.deft': {'T110', 'T158', 'T23', 'T51', 'T69', 'T7'}, 'data/source_txt/t1_biology_2_505.deft': {'T204', 'T229', 'T36'}, 'data/source_txt/t6_sociology_0_0.deft': {'T147', 'T40', 'T54', 'T82'}, 'data/source_txt/t1_biology_2_303.deft': {'T227', 'T36', 'T61'}, 'data/source_txt/t1_biology_1_0.deft': {'T143', 'T177', 'T238', 'T27', 'T47', 'T80'}, 'data/source_txt/t1_biology_0_0.deft': {'T103', 'T105', 'T109', 'T139', 'T151', 'T193', 'T211'}, 'data/source_txt/t7_government_1_202.deft': {'T88', 'T97'}, 'data/source_txt/t1_biology_2_101.deft': {'T127', 'T236', 'T243', 'T257', 'T261'}, 'data/source_txt/t2_history_0_101.deft': {'T9', 'T95'}, 'data/source_txt/t4_psychology_0_101.deft': {'T228', 'T248', 'T272', 'T28'}, 'data/source_txt/t3_physics_1_101.deft': {'T113', 'T143', 'T212', 'T31', 'T74', 'T98'}, 'data/source_txt/t3_physics_1_0.deft': {'T123', 'T126', 'T135', 'T152', 'T34', 'T43'}, 'data/source_txt/t1_biology_0_202.deft': {'T101', 'T120', 'T151', 'T159', 'T169', 'T281', 'T292', 'T298', 'T314', 'T51', 'T52', 'T56', 'T6', 'T64', 'T70', 'T85'}, 'data/source_txt/t5_economic_2_0.deft': {'T105', 'T168', 'T171', 'T63', 'T77', 'T89'}, 'data/source_txt/t7_government_2_0.deft': {'T20', 'T31', 'T36', 'T6'}, 'data/source_txt/t1_biology_1_606.deft': {'T127', 'T136', 'T18', 'T213', 'T230', 'T28', 'T89', 'T94', 'T99'}, 'data/source_txt/t4_psychology_2_202.deft': {'T38'}, 'data/source_txt/t7_government_2_202.deft': {'T31'}, 'data/source_txt/t5_economic_2_101.deft': {'T65'}, 'data/source_txt/t7_government_0_404.deft': {'T32', 'T36', 'T43'}, 'data/source_txt/t1_biology_1_101.deft': {'T100', 'T180', 'T188', 'T254', 'T54', 'T55'}, 'data/source_txt/t6_sociology_2_101.deft': {'T31'}, 'data/source_txt/t3_physics_2_0.deft': {'T135', 'T182', 'T19', 'T8', 'T96'}, 'data/source_txt/t2_history_1_101.deft': {'T72', 'T81'}, 'data/source_txt/t1_biology_0_606.deft': {'T253', 'T3', 'T85'}, 'data/source_txt/t1_biology_0_404.deft': {'T15', 'T159', 'T232', 'T246', 'T288', 'T346', 'T38', 'T62', 'T77', 'T9'}, 'data/source_txt/t5_economic_0_0.deft': {'T145'}, 'data/source_txt/t5_economic_2_202.deft': {'T140', 'T2', 'T93'}, 'data/source_txt/t4_psychology_0_0.deft': {'T212', 'T4', 'T72', 'T78', 'T82'}, 'data/source_txt/t1_biology_2_0.deft': {'T39', 'T59', 'T72', 'T98'}, 'data/source_txt/t4_psychology_1_101.deft': {'T157', 'T178', 'T179', 'T189', 'T210'}, 'data/source_txt/t1_biology_1_202.deft': {'T116', 'T16', 'T163', 'T172', 'T271', 'T30', 'T40', 'T57'}, 'data/source_txt/t4_psychology_1_202.deft': {'T113', 'T155', 'T28', 'T4', 'T44'}, 'data/source_txt/t7_government_0_101.deft': {'T72'}, 'data/source_txt/t1_biology_2_202.deft': {'T194', 'T203', 'T230', 'T263', 'T77'}, 'data/source_txt/t3_physics_0_0.deft': {'T29'}, 'data/source_txt/t7_government_2_101.deft': {'T31'}, 'data/source_txt/t7_government_2_303.deft': {'T7', 'T9'}}

davletov-aa avatar Mar 11 '20 13:03 davletov-aa

And here a little bit of left examples: {'data/source_txt/t1_biology_1_505.deft': {'T190', 'T195', 'T243', 'T246', 'T282', 'T283'}, 'data/source_txt/t1_biology_0_303.deft': {'T129', 'T131', 'T176', 'T296', 'T78', 'T94'}, 'data/source_txt/t1_biology_0_101.deft': {'T261'}, 'data/source_txt/t4_psychology_0_101.deft': {'T228', 'T248'}, 'data/source_txt/t5_economic_2_0.deft': {'T107', 'T78'}}

davletov-aa avatar Mar 11 '20 17:03 davletov-aa