nl2bash icon indicating copy to clipboard operation
nl2bash copied to clipboard

Overlap in dataset splits

Open jjcmoon opened this issue 3 years ago • 0 comments

When looking at the results of make data in a clean repo clone, it seems there is a small overlap in NL descriptions of the train and test datasets (same for the train and dev). After investigating this issue, it seems that a NL description can have multiple corresponding bash commands, which can get placed in different splits. The code in data/scripts/split_data.py seems to address this in the wrong way. The script checks if identical bash commands are placed in different splits. This would be appropriate when performing Bash2NL but not the other way round.

As the amount of descriptions with multiple commands is not that large, the overlap is not very large, so the performance reported will be only slightly decreased (i guesstimate around 1%, have not tried). But I figured you still might want to be aware of this.

jjcmoon avatar Jul 20 '20 13:07 jjcmoon