Why is a2-split harder than a1-split?
Hi,
According to the a2-split data and verify_split_tests.ipynb, I notice that 'red' or 'square' must appear in the command.
1, The paper claims a2-split aims to "test model performance on a novel combination of the target referent’s visual features". What is novel combination of the target referent's visual features? And what does the target referent mean? Do you mean combine OBJ1 and OBJ2?
2, Why is a2-split harder than a1-split? I think command in a2 is very similar with command in a1. For example,
example in A1-split="pull,the,yellow,square,that,is,in,the,same,row,as,a,big,yellow,circle,and,in,the,same,column,as,a,small,blue,cylinder,while,spinning"
example in A2-split="pull,the,small,green,object,that,is,in,the,same,column,as,the,big,red,square,while,spinning".
For example in A1, yellow square doesn't appear in the training command, but appears in the training set as background object. For example in A2, red square doesn't appear in the training command, but appears in the training set as background object. It seems the same.
Hi,
Thanks for your questions. These questions are really testing my memory here :). I could recall them incorrectly, but here are some thoughts on how a1 and a2 differ, and why we may hypothesize that a2 is harder than a1 for a reason which is about referent's visual features.
-
Re: a1 is for novel modifier, which specifically means testing for a novel combination of modifier NP (i.e., new color and shape combination). As a result, we construct a1 by filtering all examples containing yellow square (see the dataset generation code attached, and you can find them here). However, a2 focuses not only on novel modifier (a.k.a. we also filter based on the phrase), we add an extra filter that none of the actual object mentioned in the command can be red square in the training set (not about the testing set). Think about this command "pull the small red object ....", the small red object here can still refer to a red square but without saying red square in the command. We thus call this novel attribute split, since it is more about testing with never seen visual objects, not just about the command. That's why we argue this split is more about visual features. The phrase referent target, I think is nothing but meaning the object referred in the command in any position.
-
Re: I hope the explanations above clarify the difference between these two splits a little. And again, I don't think we know a2 is harder than a1 as a priori. We ran experiments, and observed a2 has lower performance. Maybe this is clearer by now, but let me explain more with the examples provided:
example in A1-split="pull,the,yellow,square,that,is,in,the,same,row,as,a,big,yellow,circle,and,in,the,same,column,as,a,small,blue,cylinder,while,spinning"
example in A2-split="pull,the,small,green,object,that,is,in,the,same,column,as,the,big,red,square,while,spinning".
These two examples in the testing set are very similar! This is expected; because I think the difference for these two splits are during the training set creation time (i.e., what examples we allow during training). For A1 as an example, we probably allow yellow square appear as a referred object using phrase like yellow object in the training set. So again, these two testing splits may look similar, but the training dataset is created in a way that these two splits are testing different aspects.
Hope these help! If so, please close this issue. Thanks!