larq icon indicating copy to clipboard operation
larq copied to clipboard

BNN taking longer time then full precision Network

Open kumarmanas opened this issue 5 years ago • 6 comments

I was trying to compare Larq BNN and full precision ( by making Integer and kernel_quantizer= None). I found that time taken to run the program is more for BNN compare to Full precision. Is it ok? Since time to train is an important parameter for the efficient network.

kumarmanas avatar Nov 23 '19 23:11 kumarmanas

by making Integer and kernel_quantizer= None

Could you elaborate a bit on what you are doing? If possible, it would be good to post a minimal code sample that reproduces the issue.

I found that time taken to run the program is more for BNN compare to Full precision.

Are you referring to time per epoch or step, or total training time? Could you elaborate on the time difference?

lgeiger avatar Nov 24 '19 09:11 lgeiger

For below segment of code in BNN example provided by you, I have made Integer and kernel_quantizer= None instead of ste_sign

larq.layers.QuantDense(512,kernel_quantizer="ste_sign",kernel_constraint="weight_clip"), larq.layers.QuantDense(10,input_quantizer="ste_sign",kernel_quantizer="ste_sign",kernel_constraint="weight_clip",activation="softmax")])

I have used time.clock() and time.time() to measure total training time of running the code and Found BNN time is greater than Full precision. I just put time.clock() in the start and end of the program to get the total time of running the BNN and Full precision program.

Code which I used to test- https://github.com/larq/larq/blob/master/docs/examples/mnist.ipynb

kumarmanas avatar Nov 25 '19 10:11 kumarmanas

I have made Integer and kernel_quantizer= None instead of ste_sign

What do you mean by "Integer" in this context?

I have used time.clock() and time.time() to measure total training time of running the code and Found BNN time is greater than Full precision

What's the time difference?

Larq (and TensorFlow) use fake quantization during training, thus run the calculations in float32 or float16. When using a latent weight based training method this means that during training for kernel and inputs we add additional calculations (i.e. ste_sign) to compute the binarization which may result in slightly slower training times. We are thinking about way to make this significantly faster by implementing a truly binary forward pass, but we currently have no immediate plans for this.

Since time to train is an important parameter for the efficient network.

I agree training time is important, but the main goal is to train networks that can be run efficiently during inference, so an increase in training time is often unavoidable.

lgeiger avatar Nov 26 '19 11:11 lgeiger

What's the time difference?

For BNN total time taken from the start of the program(starting from dataset load) till model.fit is 184.05 second and for evaluation (model. evaluate) it took 2.41 sec. For full precision, time is 169.52 second and 0.000094 seconds respectively. No of epoch is 6 and Code structure- start_time = time.clock() tf.keras.datasets.mnist.load_data() code lines as shown in example of larq ........ ..... model.compile(...) model.fit(...) print (time.clock() - start_time, "train seconds") # this time is 184s for BNN and 169s for Full precision Eval_time=time.clock() test_loss, test_acc = model.evaluate(....) print (time.clock() - Eval_time) # this time is 2.41s for BNN and 0.000094s for Full precision

What do you mean by "Integer" in this context?

Sorry for typo it was input_quantizer.

Note- Time can vary slightly but the pattern is always the same (BNN taking longer then Full precision). Both for train and evaluate.

kumarmanas avatar Nov 26 '19 20:11 kumarmanas

I'm facing the same issue. I tried the simple models below just to see the speed and file size change and put aside the accuracy for the moment.

# full-precision model
simplemodel=models.Sequential()
simplemodel.add(layers.Conv2D(32,(3,3),padding='same',input_shape=(32,32,3)))
simplemodel.add(layers.Flatten())
simplemodel.add(layers.Dense(10,activation='sigmoid'))

# binarized model
kwargs = dict(input_quantizer="ste_sign",
              kernel_quantizer="ste_sign",
              kernel_constraint="weight_clip",
              use_bias=False)

simplemodelbnn=models.Sequential()
simplemodelbnn.add(lq.layers.QuantConv2D(32,3,kernel_quantizer="ste_sign",kernel_constraint="weight_clip", use_bias=False,input_shape=(32,32,3)))
simplemodelbnn.add(layers.Flatten())
simplemodelbnn.add(lq.layers.QuantDense(10, **kwargs,activation='sigmoid'))

I ran both model on the CIFAR 10 datasets normalized to (0,1) and (-1,1), with the same compile and 2 epochs as an example. Full precision model has Total params: 328,586, and binarized model has Total params: 289k. But for both training and inference, full precision model ran faster than the binarized model. and the full precision model has smaller file size.

from what @lgeiger said, now I can understand the slower training for binarized model, but why is the inference also slower?

Larq (and TensorFlow) use fake quantization during training, thus run the calculations in float32 or float16. When using a latent weight based training method this means that during training for kernel and inputs we add additional calculations (i.e. ste_sign) to compute the binarization which may result in slightly slower training times. We are thinking about way to make this significantly faster by implementing a truly binary forward pass, but we currently have no immediate plans for this.

the difference is small in absolute value since it's ralatively a small dataset, but I tried several times, binarized model all ran slower. the running time is read from the model.fit and model.evaluate output. both per epoch and per step

susuhu avatar Sep 03 '21 08:09 susuhu

@susuhu Larq BNN inference is slower than full precision inference, because Tensorflow does not actually support binarized operations. To make it possible to train and evaluate BNNs, larq therefore adds "fake" quantizers before the activations and weights that need to be binarized, mapping them from their original float values to -1.0 or 1.0. Note that even these binary values are floats: again, tensorflow does not support non-float computations. This is also the reason the binary model may not be any smaller than the full precision model in keras: technically the weights are still floats.

The speedup you're looking for can be obtained with the Larq Compute Engine, an inference engine based on Tensorflow Lite that does support binary operations and therefore is much faster than running a "fake" BNN in the python tensorflow library. Hope that clears up some confusion!

jneeven avatar Sep 03 '21 08:09 jneeven