skillful_nowcasting
skillful_nowcasting copied to clipboard
Trying to execute run.py in train folder renders an error
Hi, Really great and helpful code! I was trying to run train.py on the nimrod-uk-1km-test data and encountered the following error, it says "RuntimeError: Serialization of parametrized modules is only supported through state_dict()." I searched on torch's website and found an earlier commit, so I downgraded torch to v1.12.0 but this did not go away. Torch Link: https://github.com/pytorch/pytorch/issues/69413
Can you guys help in debugging this issue? I am planning to use this on another dataset

Hi, are you using multiple GPUs? By default the run.py tries to use 6 GPUs, although it should be changed to 1. The spectrally normalized layers in PyTorch don't seem to work in multi-GPU setting as far as I have been able to get them. So if you do change it to 1 GPU, it should start training
I was earlier using cpu's, to sort the issue started using 1 gpu, but the training fills virtual memory of upto 200 GB(my system's limit) and the dataloader worker is killed. Can you suggest a way to bypass this.
I met the same issue, the memory keeps increasing to 256GB in the data loading process until it got killed by the system, any solution to solve this?
I met the same issue, the memory keeps increasing to 256GB in the data loading process until it got killed by the system, any solution to solve this?
Updates: My problem is solved by setting streaming=True in TFDataset as follows for my own dataset, by doing this, data are not first loaded into memory.
class TFDataset(torch.utils.data.dataset.Dataset): def init(self, data_path, split): super().init() # self.reader = load_dataset( # "openclimatefix/nimrod-uk-1km", "sample", split=split, streaming=True # ) self.reader = load_dataset(data_path, split=split, streaming=True)