ConvolutionalNeuralOperator icon indicating copy to clipboard operation
ConvolutionalNeuralOperator copied to clipboard

Trouble reproducing results shown in the paper

Open brfi3983 opened this issue 1 year ago • 6 comments

Overview

For whatever reason, I can not seem to reproduce the results that are shown in the paper. I have a feeling that I am doing something specific that is messing up results as I can not get the same accuracy for UNet, FNO, nor CNO. I am using my own setup, so not a clone of the repo, but I am using the same models and setup, so I should get the same results. Unfortunately, my results for any of the models for the NS case is minimum 10% error, and not 3% error like shown in the paper. In any case, I will outline what exactly is my setup and if there is a glaring mistake, I would appreciate any feedback.

Data

Dataset: Navier Stokes 64x64 dataset. (Assume this is the case for everything below)

Data splits: I split up the data into training and testing like what you do :

# Import data
f = h5py.File('../NavierStokes_64x64_IN.h5', 'r')
x = []
y = []
for key in f.keys():
    x.append(f[key]['input'])
    y.append(f[key]['output'])
X = np.array(x)
y = np.array(y)

# Permute axis
X = X[:, np.newaxis, ...]
y = y[:, np.newaxis, ...]

# Split data up into train, val, test
X_train, y_train = X[:768], y[:768]
X_valid, y_valid = X[768:768 + 128], y[768:768 + 128]
X_test, y_test = X[768 + 128:768 + 128*2], y[768 + 128:768 + 128*2]

# Transform data (it seems you this this normalization based off your code)
min_data, max_data = np.min(X_train), np.max(X_train)
#min_model, max_model = np.min(y_train), np.max(y_train) (this is not used - I only apply the transformation on X)

class NormalizeMinMax(torch.nn.Module):
    def __init__(self, img_min, img_max):
        self.img_min = img_min
        self.img_max = img_max
        
    def __call__(self, img):
        new_img = (img - self.img_min) / (self.img_max - self.img_min)
        return new_img
    
transform = transforms.Compose(
    [
        NormalizeMinMax(min_data, max_data),
    ]
)

# Convert to tensors
X_train = torch.tensor(X_train, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.float32)

X_valid = torch.tensor(X_valid, dtype=torch.float32)
y_valid = torch.tensor(y_valid, dtype=torch.float32)

X_train = transform(X_train)
X_valid = transform(X_valid)

Hyperparameters

Here, I use the models supplied in the repository with a PyTorch Lightning setup. Below, I will provide some psuedocode to explain the hyperparameter/optimization setups. I apply the LR scheduler per epoch so each epoch the LR becomes LR * 0.98.

loss_function = nn.L1Loss() # CNO and UNet
loss_function = nn.SmoothL1Loss() # FNO

batch_size = 32 # CNO and FNO
batch_size = 10 # UNet

optimizer = torch.optim.AdamW(self.parameters(), lr=1e-3, weight_decay=1e-6) # FNO
optimizer = torch.optim.AdamW(self.parameters(), lr=1e-3, weight_decay=1e-10) # CNO
optimizer = torch.optim.AdamW(self.parameters(), lr=5e-4, weight_decay=1e-6) # UNet

scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.98) #unet, cno, and fno

Model Setup

Here are some code snippets that show how I initialize each of the models. Each of these are using the best performing hyperparameters you mention in the paper.

# CNO
class CNO2d(nn.Module):
    def __init__(self,
                in_dim = 1,                    # Number of input channels.
                out_dim = 1,                   # Number of input channels.
                size = 64,                      # Input and Output spatial size (required )
                N_layers = 3,                  # Number of (D) or (U) blocks in the network
                N_res = 1,                 # Number of (R) blocks per level (except the neck)
                N_res_neck = 8,            # Number of (R) blocks in the neck
                channel_multiplier = 32,   # How the number of channels evolve?
                use_bn = False,             # Add BN? We do not add BN in lifting/projection layer
                ):
# FNO
class FNO2d(nn.Module):
    def __init__(self, in_channels = 1, out_channels = 1, device=None):
        super(FNO2d, self).__init__()

        """
        The overall network. It contains 4 layers of the Fourier layer.
        1. Lift the input to the desire channel dimension by self.fc0 .
        2. 4 layers of the integral operators u' = (W + K)(u).
            W defined by self.w; K defined by self.conv .
        3. Project from the channel space to the output space by self.fc1 and self.fc2 .

        input: the solution of the coefficient function and locations (a(x, y), x, y)
        input shape: (batchsize, x=s, y=s, c=3)
        output: the solution
        output shape: (batchsize, x=s, y=s, c=1)
        """
        self.modes1 = 16 #16
        self.modes2 = 16 #16
        self.width = 128 #64
        self.n_layers = 5 #5
        self.retrain_fno = 4 #4
        self.padding = 0 #0
        self.include_grid = 1 #1
        self.input_dim = in_channels
        self.act  = nn.LeakyReLU() 
        self.device = device
 # UNet
 class UNet(nn.Module):
    def __init__(self, n_channels = 1, n_classes = 1, bilinear=False):
        super(UNet, self).__init__()
        self.n_channels = n_channels
        self.n_classes = n_classes
        self.bilinear = bilinear

        self.inc = (DoubleConv(n_channels, 64))
        self.down1 = (Down(64, 128))
        self.down2 = (Down(128, 256))
        self.down3 = (Down(256, 512))
        factor = 2 if bilinear else 1
        self.down4 = (Down(512, 1024 // factor))
        self.up1 = (Up(1024, 512 // factor, bilinear))
        self.up2 = (Up(512, 256 // factor, bilinear))
        self.up3 = (Up(256, 128 // factor, bilinear))
        self.up4 = (Up(128, 64, bilinear))
        self.outc = (OutConv(64, n_classes))

Error Calculations

Finally, I compute the Relative L1 Error as mentioned in your paper. I Apply the below function for each sample in my test set via a simple enumerated for loop, iterating through y_predicted.

X_test = torch.tensor(X_test, dtype=torch.float32)
X_test = transform(X_test)
test_loader = DataLoader(list(X_test), shuffle=False, batch_size=1)
y_predicted = trainer.predict(network, test_loader)

def relative_l1_error(y_pred, y_true):
    return torch.mean(torch.abs(y_pred - y_true)) / np.mean(np.abs(y_true))

Results

Now, comes the questions. As explained above, I can not get results lower than 10% on any of the models. I am attempting to get the 2.5-3.5% Relative L1 errors outlined in the paper for all CNO, FNO, and UNet.

My training accuracy for the SmoothL1Loss goes down to 1e-5 and the validation accuracy goes down to 0.005; however, if I use the L1Loss, I can not seem to break 10% and the loss just seems to stagnate, even with the hyperparameters given.

All models were trained without early stopping and up to 1000 epochs, but most of their loss stopped converging so I stopped it much sooner.

Conclusion

If there is anything within this setup that is clearly an issue, I would appreciate the feedback. If my understanding of any of the results are also incorrect, that would be good to know as well. My main questions personally lie within the model setups as the code in the repo and the paper were quite different, so I was not sure which setup was intended. In this case, I assumed the paper to be the correct setup and followed each table regardless of the code in the repo. Thank you very much for taking the time to read this until now - I hope that the fix is quite simple and that it was just a simple mistake on my part - but regardless, I am interested to hear what you all say. Have a great rest of your day :)

brfi3983 avatar Jul 24 '24 15:07 brfi3983

Hi @brfi3983, I'm not an expert but I can try to help you to finding the bug...

  • The first thing that I see is the scheduler... Personally I can try to increase the step_size (maybe 10 can be a better value respect to 1).

  • In the test loader can be helpful to increase the batch_size (maybe 32, but can depends on the specs of your computer), but this does not affect your results only the computation time.

Let me know if this can be helpful for you.

MaxGhi8 avatar Jul 26 '24 15:07 MaxGhi8

Hi @MaxGhi8 ,

Thanks for the response.

  1. So, I am following the paper where they mention that the learning rate is multiplied by a factor $\gamma$ every epoch. In that sense, I use $\gamma=0.98$ and as I am using PyTorch Lightning, the way the scheduler is configured is for every 1 epoch. I can configure it too look more like their code and try a step_size of 10 or 15, but I am just trying to follow their best hyperparameter table to reproduce errors of ~3% on NS.
  2. Batch size of 1 is just for convenience as my script iterates through it - but as you said, it does not affect results - so I am more concerned about my setup for training.

Are you able to get 3% on any of the models with the NS dataset (1024 samples)? If so, would you mind sharing your hyperparameter setup?

Thanks again!

brfi3983 avatar Jul 30 '24 08:07 brfi3983

Hi @brfi3983, I agree with your hyper-parameters setup... The only things that seems to be different is the normalization of the output functions. It seems to be that you not perform any normalization on the output, but the output perform min-max scaling even on the output. Let me know if this resolve your issue...

MaxGhi8 avatar Jul 30 '24 10:07 MaxGhi8

I agree with what @MaxGhi8 mentioned and have nothing more to add. Perhaps you could try using the exact loaders and evaluation metrics that we use. I don't see any issues with your code. Please let me know if @MaxGhi8's suggestions were helpful.

bogdanraonic3 avatar Aug 06 '24 15:08 bogdanraonic3

Hi @MaxGhi8 and @bogdanraonic3,

So I was initially confused by the suggestion about the scaling as I had indeed scaled my test set from the parameters in the training set. However, after making a reply draft to this suggestion, I realized what you meant and I think that might have been the issue.

More specifically, I believe you were referring to this line: https://github.com/camlab-ethz/ConvolutionalNeuralOperator/blob/acb7c7c3970d0d738350f04b5313540d7576e10e/CNO2d_original_version/Problems/FNOBenchmarks.py#L146

I then reran my model and am now getting about a 4% relative-L1 test error (after 200 epochs and some slight overfitting), which seems to be close enough to the 3.57% error shown in the paper (for the FNO example, at least).

To ask a follow-up question, I was always used to scaling my input variables for better optimization purposes but had the understanding that it was not really necessary to also scale the labels since, if the model learned properly, it is just mapping to a different scale of outputs, regardless of if the labels are in one unit/scale or another. Is there an intuitive explanation for this? Is there a reason why this would make such a massive difference? My intuition wants to say that by scaling the labels, you are changing the underlying manifold that the network has to map onto, and so, might make it easier to find optimums, but this is just a guess. Is scaling the labels a common practice in CV/SciML or is it a new paradigm with NOs because of some specific model assumption?

Thanks again for all the support!

brfi3983 avatar Aug 09 '24 15:08 brfi3983

Hi, I'm thinking exactly of the line that you have marked! I have the same intuition that you have about scaling the data. Indeed if you think to linear regression and suppose that you have two inputs, with one feature scaling in [0, 1000] and the other one scaling in [0, 0.001], then the learning can be hard because data are really compressed in one direction and small changes in network's parameters can drastically change results. With similar reasoning you have to scale even the output. I think that the network’s parameters are initialized to have mean equal to zero, so if your labels have values in [100, 1000] the gradients can be huge, so you have to normalize even the labels. Even if you plot the loss function during the training I expect that without labels normalization you have a very irregular shape, instead with normalization the shape is more regular and without high jumps. I always see this practice in the context of NOs, but even in standard ML or Deep Learning I think it is a good practice.

Of course the normalization parameters (min-max or mean-std) have to be computed in the train-set and use the same value for the test-set (as the authors do in this repository).

As a final rule of thumbs I encourage you to give a lot of importance even to the data and not only to the model, this can be a real time-saving practice.

Let me know if this answers your question and if you have any other curiosity or questions. Have a nice day!

MaxGhi8 avatar Aug 12 '24 12:08 MaxGhi8

Hi,

Yes, I was able to play around with the normalization on the input as well as output and see its effects. The gradient propagation also makes sense - I think I was previously used to classification problems so that is why I had that preconceived notation about not normalizing the labels.

In any case, I was able to test out different normalization combinations (channel-wise, assuming the min was 0 or the mean was 0 and changing to standardization, and all seems to be working as intended. As such, I will be closing this issue as I feel satisfied and appreciate all the feedback and help received :)

Have a good day too!

brfi3983 avatar Aug 20 '24 11:08 brfi3983