tacrev icon indicating copy to clipboard operation
tacrev copied to clipboard

md5sum mismatch

Open senwu opened this issue 4 years ago • 3 comments

Thanks for sharing this awesome work. Really appreciate!!!

I want to use this revised TACRED dataset for my study, while I found my md5 checksums don't match the ones mentioned in the README.

Here are my md5checksums:

1c090c0e3861d6ecccfd199fdf439bed  train.json 
393e7200a63ffd10a16072cbbee464dd  dev.json
d287fb2377747b74e6feae2e2bcd9264  dev_rev.json
aba500ef2f60c32bc41e366383e8cda8  test.json
4c9dfcb4c8d523420dbf0f34858362f3  test_rev.json

Also, from the patch files, I found there are 1590 samples and 936 samples in dev and test files. (Seems like those numbers doesn't match the numbers reported in the paper?)

Please let me know if I am doing anything wrong? Thanks!

senwu avatar Aug 07 '20 22:08 senwu

@SenWu Thanks for your interest in our work.

Let's see if we can narrow down the issue.

The checksums of the original TACRED (train.json, dev.json, and test.json) match, so this is fine. Could you tell me a little more about your setting, e.g., operating system and python version. It could be that storing a json behaves differently, e.g., line endings, on different platforms.

Your observation is correct, there are less samples in the patch files than reported in the paper. The reason is that the number of "revised" samples also includes those that were assigned a second label by our annotators. As the TACRED format does not support multiple labels per sample, we chose not to patch those instances.

ChristophAlt avatar Aug 11 '20 08:08 ChristophAlt

Just wanted to report, in case its helpful, that I had the same MD5sum problem as @SenWu originally when using Python2. When running with Python3, the MD5s were consistent with the ones published by @ChristophAlt.

I have not spent time identifying if the problem is just JSON formatting differences or whether there are other potentially important content differences.

liviosoares avatar Sep 29 '20 13:09 liviosoares

@liviosoares Thank you! Your feedback is very much appreciated. I'll try to identify the root cause of the problem.

ChristophAlt avatar Oct 02 '20 15:10 ChristophAlt