aihwkit icon indicating copy to clipboard operation
aihwkit copied to clipboard

Initial (and throughout) training accuracy depends on number of Epochs

Open mz11235 opened this issue 2 years ago • 2 comments

Hi,

I am trying TTv2 training for LeNet. I only modified the example 6 /examples/06_lenet5_hardware_aware.py with different RPU configuration. RPU_CONFIG = InferenceRPUConfig() to RPU_CONFIG =TTv2ReRamESPreset()

and for simplicity commented out #RPU_CONFIG.forward.w_noise_type = WeightNoiseType.ADDITIVE_CONSTANT #RPU_CONFIG.forward.w_noise = 0.02 #RPU_CONFIG.noise_model = PCMLikeNoiseModel(g_max=25.0)

Keeping everything the same, I ran N_EPOCHS = 30 and N_EPOCHS = 100. I see for training with N_EPOCHS = 30 accuracy is increasing each epoch. For N_EPOCHS = 100; the network doesn't improve on accuracy, it gets stuck between 8-12% from epoch 0 to 100.

Here is the Training results from N_EPOCHS = 30 and you can see its accuracy is increasing (from 22% to 30% within 4 epoch) image

However for N_EPOCHS = 100, the training accuracy seems stuck at ~10% image

Its strangely dependent on the number of epochs.

mz11235 avatar Aug 13 '22 00:08 mz11235

Hi @mz11235, thanks for raising the issue. I have tried to reproduce it but for me it trains also for n_epoch=100.

09:40:30 --- Started LeNet5 Training
09:40:55 --- Epoch: 0	Train loss: 2.5871	Valid loss: 2.4854	Test error: 82.81%	Accuracy: 17.19%	
09:41:17 --- Epoch: 1	Train loss: 2.6665	Valid loss: 2.2184	Test error: 78.84%	Accuracy: 21.16%	
09:41:40 --- Epoch: 2	Train loss: 2.5845	Valid loss: 2.7020	Test error: 78.29%	Accuracy: 21.71%	
09:42:04 --- Epoch: 3	Train loss: 2.6430	Valid loss: 2.7539	Test error: 78.36%	Accuracy: 21.64%	
09:42:28 --- Epoch: 4	Train loss: 2.5966	Valid loss: 2.3580	Test error: 73.53%	Accuracy: 26.47%	
09:42:52 --- Epoch: 5	Train loss: 2.6571	Valid loss: 2.7524	Test error: 75.69%	Accuracy: 24.31%	
09:43:16 --- Epoch: 6	Train loss: 2.5773	Valid loss: 2.7421	Test error: 76.18%	Accuracy: 23.82%	
09:43:40 --- Epoch: 7	Train loss: 2.5027	Valid loss: 2.5139	Test error: 74.41%	Accuracy: 25.59%	
09:44:05 --- Epoch: 8	Train loss: 2.5770	Valid loss: 2.9077	Test error: 75.58%	Accuracy: 24.42%	
09:44:29 --- Epoch: 9	Train loss: 2.5542	Valid loss: 2.4327	Test error: 74.67%	Accuracy: 25.33%	
09:44:53 --- Epoch: 10	Train loss: 2.4849	Valid loss: 2.2281	Test error: 70.37%	Accuracy: 29.63%	
09:45:17 --- Epoch: 11	Train loss: 2.4737	Valid loss: 2.6781	Test error: 74.14%	Accuracy: 25.86%	
09:45:42 --- Epoch: 12	Train loss: 2.4017	Valid loss: 2.2977	Test error: 69.21%	Accuracy: 30.79%	

However, notice that each time you run the experiments, the analog devices are drawn anew (you can set it fixed by specifying the construction_seed). If some devices are too extreme, TTv2 might fail and stuck at chance level, I have seen that before. In particular when hyper-parameter are not optimized. So what you are seeing is likely an effect of random initialization and not optimized hyper-parameters for that particular device. You could try to run it a couple of times for 100 epochs and 30 epochs and look at the distribution and whether it always fails in one setting or just occasionally. But note that the epoch is really never passed to the internal device computations, so I cannot see how it can depend on that value.

maljoras avatar Aug 15 '22 14:08 maljoras

I see. Yah it make sense, it can be construction_seed or hyper parameters. Thanks for the tip, I missed the idea that, the hyper parameters need to be optimized for different device/ weight update behavior. Let me run few more experiments and update here.

mz11235 avatar Aug 15 '22 19:08 mz11235

I am closing this issue for now. Please re-open if you find that the problem persist

maljoras avatar Oct 06 '22 13:10 maljoras