COINS icon indicating copy to clipboard operation
COINS copied to clipboard

Questions about data formatting

Open id4thomas opened this issue 3 years ago • 2 comments

Hi, I am currently trying to reproduce your work (specifically COINS GR) and have a few questions about the training data.

From your paper it seems the training data for Knowledge Model would be

  • [SOS] S1 S2 [SEP] S5 [EOS] S2 # EFFECT # Effect2
  • [SOS] S1 S2 [SEP] S5 [EOS] S5 # CAUSE # Casue2
  • [SOS] S1 S2 S3 [SEP] S5 [EOS] S3 # EFFECT # Effect3
  • [SOS] S1 S2 S3 [SEP] S5 [EOS] S5 # CAUSE # Cause3

and

Story Model would be

  • Cause2 [SEP] Effect2 [EOK] [SOS] S1 S2 [SEP] S5 [EOS] S3
  • Cause3 [SEP] Effect3 [EOK] [SOS] S1 S2 S3 [SEP] S5 [EOS] S3

But looking at the part where you load the data (https://github.com/Heidelberg-NLP/COINS/blob/main/model/src/data/conceptnet.py) it is confusing which corresponds to which. Also, the data downloaded with the given script doens't match the format used in the rest of the code

It would be nice if you could provide a data sample for each Knowledge and Story Models or the model weight if possible.

Thank you

id4thomas avatar Jun 16 '22 09:06 id4thomas

Hi Song,

Sorry for the delayed reply. You are looking into the file for Story Model. So, Line 93 reads the input which is in the following format: self.masks[split]["total"] = [(len(i[0]), len(i[1]), len(i[2]), len(i[3]), len(i[4]), len(i[5]), len(i[6]), len(i[7]), len(i[8]), len(i[9]), len([10])) for i in sequences[split]]

During Training:
where i is the following: Incomplete Story(i.e, S1, S2 [SEP] S5) #Effect# S2 \t Ouput_Effect_S2 \t Incomplete Story(i.e, S1, S2 [SEP] S5) #Cause# S5 \t Ouput_Cause_S5 \t Incomplete Story(i.e, S1, S2 [SEP] S5) \t Incomplete Story(i.e, S1, S2 [SEP] S5) [SEP] Ouput_Effect_S2 [SEP] Ouput_Cause_S5 \t Output_S3 \t Incomplete Story(i.e, S1, S2 S3 [SEP] S5) #Effect# S3 \t Ouput_Effect_S3 \t Incomplete Story(i.e, S1, S2 S3 [SEP] S5) #Cause# S5 \t Ouput_Cause_S5 \t Incomplete Story(i.e, S1, S2 S3 [SEP] S5) \t Incomplete Story(i.e, S1, S2 S3 [SEP] S5) [SEP] Ouput_Effect_S3 [SEP] Ouput_Cause_S5 \t Output_S4 \t S2 +'\t'+ S1 +' '+ S2 +'\t'+ S5+ '\n'

I hope this answers your question. Feel free to ask me any questions.

debjitpaul avatar Jun 27 '22 10:06 debjitpaul

Thank you for the feedback!

However, it is still hard to understand the given example..

Considering both files below

https://github.com/Heidelberg-NLP/COINS/blob/main/model/src/data/conceptnet.py

https://github.com/Heidelberg-NLP/COINS/blob/main/model/src/train/batch.py

in the for loop of batch_conceptnet_generate function (line 92)

  • i1, o1, .. names taken from line 257 of conceptnet.py (do_example)
  • seq[] taken from line 127 of [conceptnet.py](http://conceptnet.py) onwards (make_tensors)

when i==0

  • [:,0,0,:]: input_knowledge → seq[0] + seq[1] → i1 + o1
  • [:,1,0,:]: input_story_completion → seq[2] + seq[3] → i2 + o2

when i==1

  • [:,0,1,:]: input_knowledge → seq[4] + seq[5] → i3 + o3
  • [:,1,1,:]: input_story_completion → seq[6] + seq[7] → i4 + o4

So does it mean i1, o1, i3, o3 corresponds to

Incomplete Story, Ouput_Effect_S2/Ouput_Cause_S5, Incomplete Story, Ouput_Effect_S3/Ouput_Cause_S5

and i2,o2, i4, o4 to

Incomplete Story, Output_S3, Incomplete Story, Output_S4?

Also when splitting the example given at line 94 of conceptnet.py (make_tensors) the list would be

  1. Incomplete Story(i.e, S1, S2 [SEP] S5) #Effect# S2
  2. Ouput_Effect_S2
  3. Incomplete Story(i.e, S1, S2 [SEP] S5) #Cause# S5
  4. Ouput_Cause_S5
  5. Incomplete Story(i.e, S1, S2 [SEP] S5)
  6. Incomplete Story(i.e, S1, S2 [SEP] S5) [SEP] Ouput_Effect_S2 [SEP] Ouput_Cause_S5
  7. Output_S3
  8. Incomplete Story(i.e, S1, S2 S3 [SEP] S5) #Effect# S3
  9. Ouput_Effect_S3
  10. Incomplete Story(i.e, S1, S2 S3 [SEP] S5) #Cause# S5
  11. Ouput_Cause_S5
  12. Incomplete Story(i.e, S1, S2 S3 [SEP] S5)
  13. Incomplete Story(i.e, S1, S2 S3 [SEP] S5) [SEP] Ouput_Effect_S3 [SEP] Ouput_Cause_S5
  14. Output_S4
  15. S2 +'\t'+ S1 +' '+ S2 +'\t'+ S5+ '\n’

It doesn’t seem to match the 11 sequences the code expects.

id4thomas avatar Jun 29 '22 14:06 id4thomas