[QUESTION]: training dataloader is currently defined with shuffle=False. Is this intentional?

Open coparks2012 opened this issue 3 years ago • 1 comments

Hello,

I have really enjoyed going through the molpal codebase.

While studying the code, I noticed that the default value for shuffle in the MoleculeDataLoader class is False. Currently, train_dataloader used for training the MPN model in the train function of the MPN class in script mpnmodels.py (lines 177-179) is being created with shuffle=False.

train_dataloader = MoleculeDataLoader(
            train_data, self.batch_size, self.ncpu, pin_memory=False
        )

I am wondering if this is intentional? This could potentially lead to overfitting during training. The code below would create the dataloader with shuffle=True.

train_dataloader = MoleculeDataLoader(
         train_data, self.batch_size, self.ncpu, shuffle=True, pin_memory=False
     )

Jun 21 '22 20:06 coparks2012

Hey @coparks2012 - thanks for poring through the code. You're right that it's weird that we don't shuffle the data and that it might cause training instability based on the ordering of the dataset. However, I can tell you that the MPNN code was adapted from the legacy chemprop code and there were quite a few odd design choices in there. When I wrote this, my goal was to simply interface the two codebases with as little changes to chemprop as possible. Of course, it resulted in instances like this, but it seems to work fairly well across a variety of tasks. If we were to change it, we'd have to redo a bunch of the original studies to assess the impact (positive, negative, or neutral) of shuffling on overall optimization performance, and that's just something I've never gotten around to.

Jun 24 '22 15:06 davidegraff