aihwkit
aihwkit copied to clipboard
Initial (and throughout) training accuracy depends on number of Epochs
Hi,
I am trying TTv2 training for LeNet. I only modified the example 6 /examples/06_lenet5_hardware_aware.py with different RPU configuration.
RPU_CONFIG = InferenceRPUConfig()
to
RPU_CONFIG =TTv2ReRamESPreset()
and for simplicity commented out
#RPU_CONFIG.forward.w_noise_type = WeightNoiseType.ADDITIVE_CONSTANT
#RPU_CONFIG.forward.w_noise = 0.02
#RPU_CONFIG.noise_model = PCMLikeNoiseModel(g_max=25.0)
Keeping everything the same, I ran N_EPOCHS = 30
and N_EPOCHS = 100
. I see for training with N_EPOCHS = 30
accuracy is increasing each epoch. For N_EPOCHS = 100
; the network doesn't improve on accuracy, it gets stuck between 8-12% from epoch 0 to 100.
Here is the Training results from N_EPOCHS = 30
and you can see its accuracy is increasing (from 22% to 30% within 4 epoch)
However for N_EPOCHS = 100
, the training accuracy seems stuck at ~10%
Its strangely dependent on the number of epochs.
Hi @mz11235,
thanks for raising the issue. I have tried to reproduce it but for me it trains also for n_epoch=100
.
09:40:30 --- Started LeNet5 Training
09:40:55 --- Epoch: 0 Train loss: 2.5871 Valid loss: 2.4854 Test error: 82.81% Accuracy: 17.19%
09:41:17 --- Epoch: 1 Train loss: 2.6665 Valid loss: 2.2184 Test error: 78.84% Accuracy: 21.16%
09:41:40 --- Epoch: 2 Train loss: 2.5845 Valid loss: 2.7020 Test error: 78.29% Accuracy: 21.71%
09:42:04 --- Epoch: 3 Train loss: 2.6430 Valid loss: 2.7539 Test error: 78.36% Accuracy: 21.64%
09:42:28 --- Epoch: 4 Train loss: 2.5966 Valid loss: 2.3580 Test error: 73.53% Accuracy: 26.47%
09:42:52 --- Epoch: 5 Train loss: 2.6571 Valid loss: 2.7524 Test error: 75.69% Accuracy: 24.31%
09:43:16 --- Epoch: 6 Train loss: 2.5773 Valid loss: 2.7421 Test error: 76.18% Accuracy: 23.82%
09:43:40 --- Epoch: 7 Train loss: 2.5027 Valid loss: 2.5139 Test error: 74.41% Accuracy: 25.59%
09:44:05 --- Epoch: 8 Train loss: 2.5770 Valid loss: 2.9077 Test error: 75.58% Accuracy: 24.42%
09:44:29 --- Epoch: 9 Train loss: 2.5542 Valid loss: 2.4327 Test error: 74.67% Accuracy: 25.33%
09:44:53 --- Epoch: 10 Train loss: 2.4849 Valid loss: 2.2281 Test error: 70.37% Accuracy: 29.63%
09:45:17 --- Epoch: 11 Train loss: 2.4737 Valid loss: 2.6781 Test error: 74.14% Accuracy: 25.86%
09:45:42 --- Epoch: 12 Train loss: 2.4017 Valid loss: 2.2977 Test error: 69.21% Accuracy: 30.79%
However, notice that each time you run the experiments, the analog devices are drawn anew (you can set it fixed by specifying the construction_seed
). If some devices are too extreme, TTv2 might fail and stuck at chance level, I have seen that before. In particular when hyper-parameter are not optimized. So what you are seeing is likely an effect of random initialization and not optimized hyper-parameters for that particular device. You could try to run it a couple of times for 100 epochs and 30 epochs and look at the distribution and whether it always fails in one setting or just occasionally. But note that the epoch is really never passed to the internal device computations, so I cannot see how it can depend on that value.
I see. Yah it make sense, it can be construction_seed
or hyper parameters. Thanks for the tip, I missed the idea that, the hyper parameters need to be optimized for different device/ weight update behavior. Let me run few more experiments and update here.
I am closing this issue for now. Please re-open if you find that the problem persist