DeepShift icon indicating copy to clipboard operation
DeepShift copied to clipboard

Error When Shifting Twice

Open mostafaelhoushi opened this issue 2 years ago • 7 comments

Copying the question by @mengjingyouling from this issue to create a new issue:

We also want to discuss a problem with you. In your paper, the shift network is applied in classification network, not target detection. What do you think? Is there a decline in the accuracy?

Because the shift 1 bit will lead to some accuracy loss. We want to shift twice to solve it. For example: 10 = 8 + 2( shift 3 bits + shift 1 bit). Therefore, we modify the code as follows:

def get_shift_and_sign(x, rounding='deterministic'):
  sign = torch.sign(x)
  x_abs = torch.abs(x)
  shift1 = round(torch.log(x_abs) / np.log(2), rounding)
  wr1 = 2 ** shift1
  w1 = x_abs-wr1
  shift2 = round(torch.log(w1) / np.log(2), rounding)
  return shift1,shift2, sign

def round_power_of_2(x, rounding='deterministic'):

  shift1,shift2,sign = get_shift_and_sign(x, rounding)
  x_rounded = (2.0 ** shift1+2.0 ** shift2) * sign
  return x_rounded

However, the input in class Conv2dShiftQ(_ConvNdShiftQ): function will become Nan, which should be caused by data overflow:

class Conv2dShiftQ(_ConvNdShiftQ):
... ....
... ...

  #@weak_script_method
  def forward(self, input):
    print("--------------------------------------forward---------------------------------------------------")
    print("input======",input)

Can you give some suggestions to solve it? Thank you very much.

mostafaelhoushi avatar Apr 21 '22 12:04 mostafaelhoushi

There is a paper in ICLR named APoT (Additive Powers of Two) that did something similar (sum of 2 shifts). Do you want to try their code: https://github.com/yhhhli/APoT_Quantization

Hopefully, their code should work because they did some other things (like adding a clip and weight norm) that may avoid this NaN that you got.

Also, note that APoT quantizes both weights and activations, while we only quantize weights.

mostafaelhoushi avatar Apr 21 '22 12:04 mostafaelhoushi

Also, if you are interested, I think adding weight normalization (before calling round_power_of_2(...)) to my DeepShift code might solve the problem of NaN. You can simply do weight normalization by:

        # weight normalization
        mean = self.weight.mean()
        std = self.weight.std()
        weight = self.weight.add(-mean).div(std)

        # call round_power_of_2(...) on weight

mostafaelhoushi avatar Apr 21 '22 12:04 mostafaelhoushi

I'll try. Thank you very much.

mengjingyouling avatar Apr 21 '22 12:04 mengjingyouling

I have some doubts about your code. I look forward to your answer:

class Conv2dShiftQ(_ConvNdShiftQ): .... ..... .... ....

def forward(self, input):
    self.weight.data = ste.clampabs(self.weight.data, 2**self.shift_range[0], 2**self.shift_range[1])     
    weight_q = ste.round_power_of_2(self.weight, self.rounding)
    input_fixed_point = ste.round_fixed_point(input, self.act_integer_bits, self.act_fraction_bits)

1.Why clampabs the weight to a certain range? The value range of weight should be a real number. Why is the range of weights is (-1 * (2**(weight_bits - 1) - 1), 0)? 2.Why need to process activation,and what does self.act_integer_bits,self.act_fraction_bits meanes?

Thank you very much!

mengjingyouling avatar Apr 24 '22 03:04 mengjingyouling

Let me supplement our detailed experiment process.

1.We tried to add the weight normalization code in the function Conv2dShiftQ.

class Conv2dShiftQ(_ConvNdShiftQ): .... .... ... ....

#@weak_script_method
def forward(self, input):


    **mean = self.weight.data.mean()
    std = self.weight.data.std()
    self.weight.data = self.weight.data.add(-mean).div(std)**
    self.weight.data = ste.clampabs(self.weight.data, 2**self.shift_range[0], 2**self.shift_range[1])

    weight_q = ste.round_power_of_2(self.weight, self.rounding)

.....

A error occured:

Traceback (most recent call last): File "train.py", line 667, in main(opt) File "train.py", line 564, in main train(opt.hyp, opt, device, callbacks) File "train.py", line 385, in train callbacks.run('on_train_batch_end', ni, model, imgs, targets, paths, plots, opt.sync_bn) File "/home/ubuntu/zj/yolov3/utils/callbacks.py", line 76, in run logger['callback'](*args, **kwargs) File "/home/ubuntu/zj/yolov3/utils/loggers/init.py", line 89, in on_train_batch_end self.tb.add_graph(torch.jit.trace(de_parallel(model), imgs[0:1], strict=False), []) File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py", line 750, in trace _module_class, File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py", line 991, in trace_module _module_class, File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, **kwargs) File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py", line 526, in _check_trace raise TracingCheckError(*diag_info) torch.jit._trace.TracingCheckError: Tracing failed sanity checks! ERROR: Tensor-valued Constant nodes differed in value across invocations. This often indicates that the tracer has encountered untraceable code. Node: %input.1 : Tensor = prim::Constantvalue=<Tensor>, scope: __module.model.0/__module.model.0.conv # /home/ubuntu/zj/yolov3/deepshift/ste.py:86:0 Source Location: /home/ubuntu/zj/yolov3/deepshift/ste.py(86): clampabs /home/ubuntu/zj/yolov3/deepshift/modules_q.py(294): forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl /home/ubuntu/zj/yolov3/models/common.py(47): forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl /home/ubuntu/zj/yolov3/models/yolo.py(150): _forward_once /home/ubuntu/zj/yolov3/models/yolo.py(127): forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py(965): trace_module /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py(750): trace /home/ubuntu/zj/yolov3/utils/loggers/init.py(89): on_train_batch_end /home/ubuntu/zj/yolov3/utils/callbacks.py(76): run train.py(385): train train.py(564): main train.py(667): Comparison exception: Tensor-likes are not close!

	Mismatched elements: 190 / 432 (44.0%)
	Greatest absolute difference: 0.1587936282157898 at index (3, 1, 1, 0) (up to 1e-05 allowed)
	Greatest relative difference: 0.26716628984998353 at index (0, 0, 0, 0) (up to 1.3e-06 allowed)

2.if we comment out the code : #self.weight.data = ste.clampabs(self.weight.data, 2self.shift_range[0], 2self.shift_range[1]) The NaN is occured.

Can you give me some suggestions? Thank you very much

mengjingyouling avatar Apr 24 '22 07:04 mengjingyouling

I think we should not modify self.weight.data. Can you change the code to:

***@***.***_script_method
def forward(self, input):

    mean = self.weight.data.mean()
    std = self.weight.data.std()
    weight_norm = self.weight.data.add(-mean).div(std)
    weight_norm = ste.clampabs(weight_norm, 2**self.shift_range[0],
2**self.shift_range[1])

    weight_q = ste.round_power_of_2(weight_norm, self.rounding)

I will try to reply to your other questions later today

On Sun., Apr. 24, 2022, 3:53 a.m. mengjingyouling, @.***> wrote:

Let me supplement our detailed experiment process. 1.We tried to add the weight normalization code in the function Conv2dShiftQ.

class Conv2dShiftQ(_ConvNdShiftQ): .... .... ... ....

@.***_script_method def forward(self, input):

**mean = self.weight.data.mean()
std = self.weight.data.std()
self.weight.data = self.weight.data.add(-mean).div(std)**
self.weight.data = ste.clampabs(self.weight.data, 2**self.shift_range[0], 2**self.shift_range[1])

weight_q = ste.round_power_of_2(self.weight, self.rounding)

.....

A error occured:

Traceback (most recent call last): File "train.py", line 667, in main(opt) File "train.py", line 564, in main train(opt.hyp, opt, device, callbacks) File "train.py", line 385, in train callbacks.run('on_train_batch_end', ni, model, imgs, targets, paths, plots, opt.sync_bn) File "/home/ubuntu/zj/yolov3/utils/callbacks.py", line 76, in run logger['callback'](*args, **kwargs) File "/home/ubuntu/zj/yolov3/utils/loggers/init.py", line 89, in on_train_batch_end self.tb.add_graph(torch.jit.trace(de_parallel(model), imgs[0:1], strict=False), []) File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py", line 750, in trace _module_class, File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py", line 991, in trace_module _module_class, File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, **kwargs) File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py", line 526, in _check_trace raise TracingCheckError(*diag_info) torch.jit._trace.TracingCheckError: Tracing failed sanity checks! ERROR: Tensor-valued Constant nodes differed in value across invocations. This often indicates that the tracer has encountered untraceable code. Node: %input.1 : Tensor = prim::Constantvalue=, scope: __module.model.0/__module.model.0.conv # /home/ubuntu/zj/yolov3/deepshift/ste.py:86:0 Source Location: /home/ubuntu/zj/yolov3/deepshift/ste.py(86): clampabs /home/ubuntu/zj/yolov3/deepshift/modules_q.py(294): forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl /home/ubuntu/zj/yolov3/models/common.py(47): forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl /home/ubuntu/zj/yolov3/models/yolo.py(150): _forward_once /home/ubuntu/zj/yolov3/models/yolo.py(127): forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py(965): trace_module /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py(750): trace /home/ubuntu/zj/yolov3/utils/loggers/init.py(89): on_train_batch_end /home/ubuntu/zj/yolov3/utils/callbacks.py(76): run train.py(385): train train.py(564): main train.py(667): Comparison exception: Tensor-likes are not close!

Mismatched elements: 190 / 432 (44.0%) Greatest absolute difference: 0.1587936282157898 at index (3, 1, 1, 0) (up to 1e-05 allowed) Greatest relative difference: 0.26716628984998353 at index (0, 0, 0, 0) (up to 1.3e-06 allowed)

2.if we comment out the code : #self.weight.data = ste.clampabs(self.weight.data, 2self.shift_range[0], 2self.shift_range[1]) The NaN is occured.

Can you give me some suggestions? Thank you very much

— Reply to this email directly, view it on GitHub https://github.com/mostafaelhoushi/DeepShift/issues/16#issuecomment-1107783917, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALCKHJ2IUXC4X74F4QZWLTVGT4WPANCNFSM5T7ANQFA . You are receiving this because you authored the thread.Message ID: @.***>

mostafaelhoushi avatar Apr 24 '22 15:04 mostafaelhoushi

I think we should not modify self.weight.data. Can you change the code to: ***@***.***_script_method def forward(self, input): mean = self.weight.data.mean() std = self.weight.data.std() weight_norm = self.weight.data.add(-mean).div(std) weight_norm = ste.clampabs(weight_norm, 2**self.shift_range[0], 2**self.shift_range[1]) weight_q = ste.round_power_of_2(weight_norm, self.rounding) I will try to reply to your other questions later today On Sun., Apr. 24, 2022, 3:53 a.m. mengjingyouling, @.> wrote: Let me supplement our detailed experiment process. 1.We tried to add the weight normalization code in the function Conv2dShiftQ. class Conv2dShiftQ(_ConvNdShiftQ): .... .... ... .... @._script_method def forward(self, input): mean = self.weight.data.mean() std = self.weight.data.std() self.weight.data = self.weight.data.add(-mean).div(std) self.weight.data = ste.clampabs(self.weight.data, 2self.shift_range[0], 2self.shift_range[1]) weight_q = ste.round_power_of_2(self.weight, self.rounding) ..... A error occured: Traceback (most recent call last): File "train.py", line 667, in main(opt) File "train.py", line 564, in main train(opt.hyp, opt, device, callbacks) File "train.py", line 385, in train callbacks.run('on_train_batch_end', ni, model, imgs, targets, paths, plots, opt.sync_bn) File "/home/ubuntu/zj/yolov3/utils/callbacks.py", line 76, in run logger['callback'](args, kwargs) File "/home/ubuntu/zj/yolov3/utils/loggers/init.py", line 89, in on_train_batch_end self.tb.add_graph(torch.jit.trace(de_parallel(model), imgs[0:1], strict=False), []) File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py", line 750, in trace _module_class, File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py", line 991, in trace_module _module_class, File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(args, **kwargs) File "/home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py", line 526, in _check_trace raise TracingCheckError(diag_info) torch.jit._trace.TracingCheckError: Tracing failed sanity checks! ERROR: Tensor-valued Constant nodes differed in value across invocations. This often indicates that the tracer has encountered untraceable code. Node: %input.1 : Tensor = prim::Constantvalue=, scope: __module.model.0/__module.model.0.conv # /home/ubuntu/zj/yolov3/deepshift/ste.py:86:0 Source Location: /home/ubuntu/zj/yolov3/deepshift/ste.py(86): clampabs /home/ubuntu/zj/yolov3/deepshift/modules_q.py(294): forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl /home/ubuntu/zj/yolov3/models/common.py(47): forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl /home/ubuntu/zj/yolov3/models/yolo.py(150): _forward_once /home/ubuntu/zj/yolov3/models/yolo.py(127): forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py(965): trace_module /home/ubuntu/anaconda3/envs/yolov3/lib/python3.6/site-packages/torch/jit/_trace.py(750): trace /home/ubuntu/zj/yolov3/utils/loggers/init.py(89): on_train_batch_end /home/ubuntu/zj/yolov3/utils/callbacks.py(76): run train.py(385): train train.py(564): main train.py(667): Comparison exception: Tensor-likes are not close! Mismatched elements: 190 / 432 (44.0%) Greatest absolute difference: 0.1587936282157898 at index (3, 1, 1, 0) (up to 1e-05 allowed) Greatest relative difference: 0.26716628984998353 at index (0, 0, 0, 0) (up to 1.3e-06 allowed) 2.if we comment out the code : #self.weight.data = ste.clampabs(self.weight.data, 2self.shift_range[0], 2self.shift_range[1]) The NaN is occured. Can you give me some suggestions? Thank you very much — Reply to this email directly, view it on GitHub <#16 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALCKHJ2IUXC4X74F4QZWLTVGT4WPANCNFSM5T7ANQFA . You are receiving this because you authored the thread.Message ID: @.>

We used this code to train 1000 epoch and found that the mAP of the model grew slowly. The final accuracy is only half that of fp32 model. It seems to be there are some errors in this code, which makes the Roundpowerof2 function unable to back propagate.Then We added the print function in the back propagation part of the function Roundpowerof2 to verify our point of view. Sure enough, the above code cannot execute the back propagation code. The correct way to write the above code should be:

mean = self.weight.mean() std = self.weight.std() weight_norm = self.weight.add(-mean).div(std) #weight_norm = ste.clampabs(weight_norm, 2self.shift_range[0], 2self.shift_range[1]) #Firstly, we do not limit the bit width of the weight.

However, the problem of NaN still cannot be solved. Do you have any suggestions? Thank you very much.

mengjingyouling avatar Apr 28 '22 12:04 mengjingyouling