BIFI
BIFI copied to clipboard
bug-report: data collection questions
Hello,
Thank you for your work! I have several questions about the training and data preprocessing.
In the paper, you say that at each round, you are training the fixer and breaker only on the newly generated data. Please see the screenshot and the equations. None of the equations mentions the dataset from round-0 on which the initial breaker is trained.
This seemed a bit odd to me because I would expect the newly generated data to be merged with the existing data from round-0. The model should then be trained on this joined dataset. Because the newly generated dataset contains a lot of bias since it is synthetic, it is very likely that the models forget everything learned from the initial data, which contains real-world bugs and fixes.
At first, I thought the equations were wrong, but also the text backed up the equations. See the screenshot from the paper:
To double-check, I started looking into your code and might have found several inconsistencies.
Indeed you merge the synthetically generated dataset with the initial dataset. See here: https://github.com/michiyasunaga/BIFI/blob/main/src/c006__generate_paired_data_from_fixer.py#L47-L52
But you do it in a very strange way. You build the dataset for the next step where 1/3 of it comes from the initial data, and the other 2/3 comes from the synthetically generated data. See here: https://github.com/michiyasunaga/BIFI/blob/main/src/c006__generate_paired_data_from_fixer.py#L59-L62
What is even more strange is that you duplicate data points. The total size is set to 30’000’000, and you repeat the data points. See here: https://github.com/michiyasunaga/BIFI/blob/main/src/c006__generate_paired_data_from_fixer.py#L58-L62
You just duplicate the same samples in the dataset.
There is not a single word on all of this in the paper if I did not miss anything and I do not understand the motivation behind these choices.
Could you please clarify these issues?
Thanks in advance!
Best, Berkay Berabi