EMNLP2018-JMEE icon indicating copy to clipboard operation
EMNLP2018-JMEE copied to clipboard

Evaluate function not right

Open airkid opened this issue 6 years ago • 8 comments
trafficstars

https://github.com/lx865712528/JMEE/blob/494451d5852ba724d273ee6f97602c60a5517446/enet/testing.py#L72
In this line, if I add a line of code before
assert len(arugments) == len(argumenst_) There will be assert error.
I believe this is because in arugments there are golden arguments while only predict arugments in arguments_, which length will change dynamicly during traning.

airkid avatar Mar 01 '19 09:03 airkid

This computes the score wrong since if the model predict a wrong entity before all the good ones, the preds are not aligned and the score is 0, as shown in this example: gold roles are [(3,5,11),(7,9,9)] preds roles are [(0,2,2),(3,5,11),(7,9,9)] first iteration: compare (3,5,11) and (0,2,2) -> fail second iteration: compare (7,9,9) and (3,5,11) -> fail even though (3,5,11) was in the gold annotations. Here is a functionning version that also generate a per class report (it requires tabulate)

calculate_sets_1.txt

DorianKodelja avatar Mar 01 '19 16:03 DorianKodelja

Hi @airkid @DorianKodelja, I got with conclusion with you, according to DMCNN paper:

An argument is correctly classifiedd if its event subtype, offsets and argument role match those of any of the reference argument mentions

for item, item_ in zip(arguments, arguments_): 

Above code in this repo does match the idea, so I replaced that line with:

ct += len(set(arguments) & set(arguments_))  # count any argument in golden
# for item, item_ in zip(arguments, arguments_):
#     if item[2] == item_[2]:
#         ct += 1

mikelkl avatar Mar 07 '19 10:03 mikelkl

Hi @mikelkl , I believe this is a kind of right implementation of calculating F1 score in this task.
Have you reproduce the experiment? I can only reach F1 score < 0.4 in the test data.

airkid avatar Mar 07 '19 10:03 airkid

Hi @airkid, I got slightly higher result, but it's on my own randomly splitting test set, hv no idea if it can efficively represent the paper result.

mikelkl avatar Mar 07 '19 11:03 mikelkl

Hi @mikelkl, can you try on the data split update by author?
My result is still far away from the paper.

airkid avatar Mar 07 '19 13:03 airkid

Hi @airkid, I'm afraid I cannot do that coz I hv no ACE2005 English data

mikelkl avatar Mar 11 '19 09:03 mikelkl

Hi @airkid Would you please tell me the result you got? I got only f1=0.64 in Trigger Classification.

carrie0307 avatar Sep 05 '19 11:09 carrie0307

https://github.com/lx865712528/JMEE/blob/494451d5852ba724d273ee6f97602c60a5517446/enet/testing.py#L72 In this line, if I add a line of code before assert len(arugments) == len(argumenst_) There will be assert error. I believe this is because in arugments there are golden arguments while only predict arugments in arguments_, which length will change dynamicly during traning.

Hi,

If you've tried their code, would you tell me your reproduced results on trigger detection and argument detection?

rhythmswing avatar Jul 15 '20 05:07 rhythmswing