skillful_nowcasting icon indicating copy to clipboard operation
skillful_nowcasting copied to clipboard

Trying to execute run.py in train folder renders an error

Open bhardwaj-garvit opened this issue 2 years ago • 4 comments

Hi, Really great and helpful code! I was trying to run train.py on the nimrod-uk-1km-test data and encountered the following error, it says "RuntimeError: Serialization of parametrized modules is only supported through state_dict()." I searched on torch's website and found an earlier commit, so I downgraded torch to v1.12.0 but this did not go away. Torch Link: https://github.com/pytorch/pytorch/issues/69413

Can you guys help in debugging this issue? I am planning to use this on another dataset

Screenshot 2023-02-01 at 5 01 35 PM **To Reproduce** Steps to reproduce the behavior: 1. installing dependencies 2. execute train/run.py and the above error shows in the terminal

bhardwaj-garvit avatar Feb 01 '23 11:02 bhardwaj-garvit

Hi, are you using multiple GPUs? By default the run.py tries to use 6 GPUs, although it should be changed to 1. The spectrally normalized layers in PyTorch don't seem to work in multi-GPU setting as far as I have been able to get them. So if you do change it to 1 GPU, it should start training

jacobbieker avatar Feb 01 '23 11:02 jacobbieker

I was earlier using cpu's, to sort the issue started using 1 gpu, but the training fills virtual memory of upto 200 GB(my system's limit) and the dataloader worker is killed. Can you suggest a way to bypass this.

bhardwaj-garvit avatar Feb 05 '23 18:02 bhardwaj-garvit

I met the same issue, the memory keeps increasing to 256GB in the data loading process until it got killed by the system, any solution to solve this?

Chevolier avatar May 02 '24 14:05 Chevolier

I met the same issue, the memory keeps increasing to 256GB in the data loading process until it got killed by the system, any solution to solve this?

Updates: My problem is solved by setting streaming=True in TFDataset as follows for my own dataset, by doing this, data are not first loaded into memory.

class TFDataset(torch.utils.data.dataset.Dataset): def init(self, data_path, split): super().init() # self.reader = load_dataset( # "openclimatefix/nimrod-uk-1km", "sample", split=split, streaming=True # ) self.reader = load_dataset(data_path, split=split, streaming=True)

Chevolier avatar May 06 '24 07:05 Chevolier