physicsnemo icon indicating copy to clipboard operation
physicsnemo copied to clipboard

Questions about CorrDiff example

Open GexLoong opened this issue 1 year ago • 2 comments

my machine got the error about " out of memory" Image

i have change the "total_batch_size" to 2, ues "amp-fp16" , and use dataset "'2018-01-01' to '2018-01-10'" ten days data

Are there any other places where I can modify to change the model's memory usage? What is the meaning of the "training_duration: 200000000"?

GexLoong avatar Nov 06 '24 08:11 GexLoong

my machine got the error about " out of memory" Image

i have change the "total_batch_size" to 2, ues "amp-fp16" , and use dataset "'2018-01-01' to '2018-01-10'" ten days data

Are there any other places where I can modify to change the model's memory usage? What is the meaning of the "training_duration: 200000000"?

I have solved the problem. My machine only support the "batch_size_per_gpu" as 1 . However, what is the meaning of the "training_duration: 200000000"? What does it primarily affect?

GexLoong avatar Nov 06 '24 09:11 GexLoong

@MyGitHub-G training_duration is the number of (repeated) samples/images the model sees during the training. If you divide it by the number of unique samples in the dataset, it gives you the number of epochs.

mnabian avatar Nov 08 '24 00:11 mnabian