hls4ml
hls4ml copied to clipboard
Number of filters limitation
Is there a limitation to the number of filters in a CNN, as a layer with 32 filters tends to be the bottleneck during synthesis. The synthesis is unable to complete and gets stuck at the conv2d layer having 32 filter.
Depends on your config. Assuming you use io_stream
, the limit will be related to the strategy used, since that affects the algorithm used for the CNN kernel. If you use latency strategy (the default), then filt_height x filt_width x n_channels x n_filters < 4096
. If you use io_parallel
, well, you shouldn't be using it with large models.
I have used the io_stream
and the strategy used is Resource
. Another issue that I am facing is that when I am using the config file to generate the hls4ml model from the quantized models, it results in an accuracy drop from ~75% to ~10%. I check that the baseline quantized model does predict with an accuracy of around ~71% but the hls4ml model gives drops the accuracy after synthesising the model.
Hi @wilfredkisku, can you share your model. You may look into tracing / profiling functionality.
You can make 1D plots of the expected output vs hls4ml output like so: https://github.com/hls4ml-finn-mlperftiny/CIFAR10/blob/main/hls4ml/convert.py#L222-L233
This can help you pinpoint which layers are causing a mismatch. Then you can increase the precision of those layers. Usually it's required to either increase the precision of the outputs or the accumulators (or both).
Thanks @jmduarte for the help. I am including the model to give a better picture of the model that I am trying to synthesize using hls4ml. I have been testing with IFM bit precision of 4 and weight precision of 12, also, 16 and 16 but still the accuracy for the keras model and hls4ml model differs alot.
from qkeras import QActivation
from qkeras import QDense, QConv2DBatchnorm
IFM = 16
WGT = 16
def QResNet9(input_shape = (32,32,3), classes = 10):
img_input = Input(shape=input_shape)
x = QActivation('quantized_relu('+str(IFM)+',0)',name='relu_in')(img_input)
x = QConv2DBatchnorm(16, kernel_size = (3,3), strides=(1,1),
kernel_quantizer="quantized_bits("+str(WGT)+",0,alpha=1)",
kernel_initializer='lecun_uniform',
kernel_regularizer=l1(0.0001), name='conv1', use_bias = True)(x)
x = QActivation('quantized_relu('+str(IFM)+',0)', name='relu_conv1')(x)
x = QConv2DBatchnorm(16, kernel_size = (3,3), strides=(1,1),
kernel_quantizer="quantized_bits("+str(WGT)+",0,alpha=1)",
kernel_initializer='lecun_uniform',
kernel_regularizer=l1(0.0001), name='conv2', use_bias = True)(x)
x = QActivation('quantized_relu('+str(IFM)+',0)', name='relu_conv2')(x)
x = MaxPooling2D(pool_size=(2, 2), name='pool1')(x)
x_skip = x
x_skip = QConv2DBatchnorm(16, kernel_size = (3,3), strides=(1,1),
kernel_quantizer="quantized_bits("+str(WGT)+",0,alpha=1)",
kernel_initializer='lecun_uniform',
kernel_regularizer=l1(0.0001), name='conv3', use_bias = True)(x_skip)
x_skip = QActivation('quantized_relu('+str(IFM)+',0)', name='relu_conv3_skip')(x_skip)
x = QConv2DBatchnorm(16, kernel_size = (3,3), strides=(1,1),
kernel_quantizer="quantized_bits("+str(WGT)+",0,alpha=1)",
kernel_initializer='lecun_uniform',
kernel_regularizer=l1(0.0001), name='conv4', use_bias = True)(x)
x = QActivation('quantized_relu('+str(IFM)+',0)', name='relu_conv4')(x)
#x = QConv2DBatchnorm(16, kernel_size = (3,3), strides=(1,1),
# kernel_quantizer="quantized_bits("+str(WGT)+",0,alpha=1)",
# kernel_initializer='lecun_uniform',
# kernel_regularizer=l1(0.0001), name='conv4', use_bias = True)(x)
#x = QActivation('quantized_relu('+str(IFM)+',0)', name='relu_conv4')(x)
x = Add()([x, x_skip])
x = QConv2DBatchnorm(24, kernel_size = (3,3), strides=(1,1),
kernel_quantizer="quantized_bits("+str(WGT)+",0,alpha=1)",
kernel_initializer='lecun_uniform',
kernel_regularizer=l1(0.0001), name='conv5', use_bias = True)(x)
x = QActivation('quantized_relu('+str(IFM)+',0)', name='relu_conv5')(x)
x = MaxPooling2D(pool_size=(2, 2), name='pool2')(x)
#x = QConv2DBatchnorm(32, kernel_size = (3,3), strides=(1,1),
# kernel_quantizer="quantized_bits("+str(WGT)+",0,alpha=1)",
# kernel_initializer='lecun_uniform',
# kernel_regularizer=l1(0.0001), name='conv6', use_bias = True)(x)
#x = QActivation('quantized_relu('+str(IFM)+',0)', name='relu_conv6')(x)
#x = MaxPooling2D(pool_size=(2, 2), name='pool3')(x)
x_skip = x
x_skip = QConv2DBatchnorm(24, kernel_size = (3,3), strides=(1,1),
kernel_quantizer="quantized_bits("+str(WGT)+",0,alpha=1)",
kernel_initializer='lecun_uniform',
kernel_regularizer=l1(0.0001), name='conv6', use_bias = True)(x_skip)
x_skip = QActivation('quantized_relu('+str(IFM)+',0)', name='relu_conv6_skip')(x_skip)
x = QConv2DBatchnorm(24, kernel_size = (3,3), strides=(1,1),
kernel_quantizer="quantized_bits("+str(WGT)+",0,alpha=1)",
kernel_initializer='lecun_uniform',
kernel_regularizer=l1(0.0001), name='conv7', use_bias = True)(x)
x = QActivation('quantized_relu('+str(IFM)+',0)', name='relu_conv7')(x)
#x = MaxPooling2D(pool_size=(2, 2), name='pool4')(x)
#x = QConv2DBatchnorm(32, kernel_size = (3,3), strides=(1,1),
# kernel_quantizer="quantized_bits("+str(WGT)+",0,alpha=1)",
# kernel_initializer='lecun_uniform',
# kernel_regularizer=l1(0.0001), name='conv8', use_bias = True)(x)
#x = QActivation('quantized_relu('+str(IFM)+',0)', name='relu_conv8')(x)
x = Add()([x, x_skip])
x = MaxPooling2D()(x)
#x = MaxPooling2D(pool_size=(2, 2), name='pool5')(x)
x = Flatten()(x)
x = Dense(10,name='output_dense')(x)
x_out = Activation('softmax',name='output_softmax')(x)
qmodel = Model(inputs=[img_input], outputs=[x_out], name='qkeras')
return qmodel
Accuracy Keras: 0.7033333333333334
Accuracy hls4ml: 0.11666666666666667
I am including the configuration for hls4ml that I have used.
# Then the QKeras model
hls4ml.model.optimizer.OutputRoundingSaturationMode.layers = ['Activation']
hls4ml.model.optimizer.OutputRoundingSaturationMode.rounding_mode = 'AP_RND'
hls4ml.model.optimizer.OutputRoundingSaturationMode.saturation_mode = 'AP_SAT'
hls_config_q = hls4ml.utils.config_from_keras_model(qmodel, granularity='name')
hls_config_q['Model']['Strategy'] = 'Resource'
hls_config_q['Model']['ReuseFactor'] = 144
hls_config_q['Model']['Precision'] = 'ap_fixed<16,6>'
hls_config_q['LayerName']['conv1']['Strategy'] = 'Resource'
hls_config_q['LayerName']['conv1']['ReuseFactor'] = 108
hls_config_q['LayerName']['conv2']['Strategy'] = 'Resource'
hls_config_q['LayerName']['conv2']['ReuseFactor'] = 144
hls_config_q['LayerName']['conv3']['Strategy'] = 'Resource'
hls_config_q['LayerName']['conv3']['ReuseFactor'] = 144
hls_config_q['LayerName']['conv4']['Strategy'] = 'Resource'
hls_config_q['LayerName']['conv4']['ReuseFactor'] = 144
hls_config_q['LayerName']['conv5']['Strategy'] = 'Resource'
hls_config_q['LayerName']['conv5']['ReuseFactor'] = 144
hls_config_q['LayerName']['conv7']['Strategy'] = 'Resource'
hls_config_q['LayerName']['conv7']['ReuseFactor'] = 144
hls_config_q['LayerName']['conv6']['Strategy'] = 'Resource'
hls_config_q['LayerName']['conv6']['ReuseFactor'] = 144
hls_config_q['LayerName']['output_dense']['Strategy'] = 'Resource'
hls_config_q['LayerName']['output_dense']['ReuseFactor'] = 160
hls_config_q['LayerName']['output_softmax']['Strategy'] = 'Stable'
plotting.print_dict(hls_config_q)
cfg_q = hls4ml.converters.create_config(backend='Vivado')
cfg_q['IOType'] = 'io_stream' # Must set this if using CNNs!
cfg_q['HLSConfig'] = hls_config_q
cfg_q['KerasModel'] = qmodel
cfg_q['OutputDir'] = 'quantized_cnn_model_C/'
cfg_q['XilinxPart'] = 'xczu7ev-ffvc1156-2-e'
#cfg_q['XilinxPart'] = 'xcu250-figd2104-2L-e'
hls_model_q = hls4ml.converters.keras_to_hls(cfg_q)
hls_model_q.compile()
Few other details I want to include are:
- I am using a Ubuntu installed in a VM that has a RAM of ~45 GB allocated to it.
- If I use more than 16 filters for any of the layers the synthesis build gets stuck while trying to during loop unrolling for convolutional layer for more than 16 filters, is there a way I can increase the number of filters and be able to synthesize without this issue?
- I am using Xilinx Ultrascale+ MPSoC ZCU104 .
Hi have you solve your problem? I also met the accracy problem when I tested on Resnet.
Hi have you solve your problem? Maybe you can compare the output of the "output_softmax" layer between the hls_model and the keras model. this is my problem. https://github.com/fastmachinelearning/hls4ml/issues/590
@liuhao-97 thank you for the reply. No i could not get it corrected, it's still has the accuracy drop. Is there anything else to rectify the issue that you have pointed out?
Hi have you tried with full precision model (ap_fix<32,16>)? I mean don't quantize the model and set hls config to ap_fix<32,16>. Maybe you can check the output of the last sofmax layer between the keras model and hls model.
For me I found there might be some problem with the softmax layer. I print the output of dense layer and it works fine. But for softmax layer, the output is totally different. If you check this link https://github.com/hls4ml-finn-mlperftiny/CIFAR10/blob/main/hls4ml/convert.py you will find it remove the softmax layer. so I assume there might be some problem with softmax layer.
@liuhao-97 I tried to profile the layer but came up with a `graph disconnected error'.
ValueError: Graph disconnected: cannot obtain value for tensor KerasTensor(type_spec=TensorSpec(shape=(None, 32, 32, 1), dtype=tf.float32, name='input_1'), name='input_1', description="created by layer 'input_1'") at layer "prune_low_magnitude_conv1". The following previous layers were accessed without issue: []
Hi have you tried with full precision model (ap_fix<32,16>)? I mean don't quantize the model and set hls config to ap_fix<32,16>. Maybe you can check the output of the last sofmax layer between the keras model and hls model.
The full precision layer works file, for me the accuracy drops only for the quantized hls model.
Can you print your output? Is it consisted of some same number and some zeros like [0.25, 0.25, 0.25, 0, 0, 0]?
I am not able to print the output yet.
@liuhao-97 I tried to profile the layer but came up with a `graph disconnected error'.
ValueError: Graph disconnected: cannot obtain value for tensor KerasTensor(type_spec=TensorSpec(shape=(None, 32, 32, 1), dtype=tf.float32, name='input_1'), name='input_1', description="created by layer 'input_1'") at layer "prune_low_magnitude_conv1". The following previous layers were accessed without issue: []
Did you prune the model? I think quantized pruned model can't work fine with hls4ml. Maybe you can try quantized model but don't pruned it.
Besides which hls4ml are you using? hls4ml 0.6.0 or the newest branch?
I am using hls4ml 0.6.0. Has this issue been resolved in the new branch?
I am using hls4ml 0.6.0. Has this issue been resolved in the new branch?
not sure. You can have a try with new brench. Besides, have you tried with "io_type='io_parallel'"? Maybe it can sove the problem.
I am using hls4ml 0.6.0. Has this issue been resolved in the new branch?
not sure. You can have a try with new brench. Besides, have you tried with "io_type='io_parallel'"? Maybe it can sove the problem.
https://github.com/fastmachinelearning/hls4ml/pull/448 Maybe you can check this.
@liuhao-97 I tried but it still did not work for me. Did you get a workaround to make sure that the accuracy does not drop?
I think it is because you prune the model. When you prune the model, somehow the original connection of the model layer by layer goes wrong, which can be seen in your error.
Can you try with a non-pruning model again to see if there is still accuracy loss?
@liuhao-97 I tried to profile the layer but came up with a `graph disconnected error'.
ValueError: Graph disconnected: cannot obtain value for tensor KerasTensor(type_spec=TensorSpec(shape=(None, 32, 32, 1), dtype=tf.float32, name='input_1'), name='input_1', description="created by layer 'input_1'") at layer "prune_low_magnitude_conv1". The following previous layers were accessed without issue: []
@liuhao-97 yes the error has been removed after I removed pruning. Thank you.
@jmduarte models that have concatenate layer or add layers have a considerable accuracy drop. Might be a bug.