darknetz
darknetz copied to clipboard
Convolutional layer and maxpool layer are not working in TrustZone
Hi mofan,
I cannot train a model if the convolutional layers and maxpool layers are partitioned in the TrustZone. I tested vgg-7 model with cifar10 using the following command.
# darknetp classifier train -pp 4 cfg/cifar.data cfg/vgg-7_cifar10.cfg
I got the TEEC_InvokeCommand(FC) error.
Prepare session with the TA
Begin darknet
vgg-7_cifar10
1
layer filters size input output
0 conv 16 3 x 3 / 1 32 x 32 x 3 -> 32 x 32 x 16 0.001 BFLOPs
1 conv 16 3 x 3 / 1 32 x 32 x 16 -> 32 x 32 x 16 0.005 BFLOPs
2 max 2 x 2 / 2 32 x 32 x 16 -> 16 x 16 x 16
3 conv 32 3 x 3 / 1 16 x 16 x 16 -> 16 x 16 x 32 0.002 BFLOPs
4 conv_TA 32 3 x 3 / 1 16 x 16 x 32 -> 16 x 16 x 32 0.005 BFLOPs
5 max_TA 2 x 2 / 2 16 x 16 x 32 -> 8 x 8 x 32
6 conv_TA 32 3 x 3 / 1 8 x 8 x 32 -> 8 x 8 x 32 0.001 BFLOPs
7 conv_TA 32 3 x 3 / 1 8 x 8 x 32 -> 8 x 8 x 32 0.001 BFLOPs
8 max_TA 2 x 2 / 2 8 x 8 x 32 -> 4 x 4 x 32
9 connected_TA 512 -> 64
darknetp: TEEC_InvokeCommand(FC) failed 0xffff3024 origin 0x3
I suspected that the error is related to model size, so I move the partition point to layer 7 using the command below.
# darknetp classifier train -pp 7 cfg/cifar.data cfg/vgg-7_cifar10.cfg
But I still got the same error.
Prepare session with the TA
Begin darknet
vgg-7_cifar10
1
layer filters size input output
0 conv 16 3 x 3 / 1 32 x 32 x 3 -> 32 x 32 x 16 0.001 BFLOPs
1 conv 16 3 x 3 / 1 32 x 32 x 16 -> 32 x 32 x 16 0.005 BFLOPs
2 max 2 x 2 / 2 32 x 32 x 16 -> 16 x 16 x 16
3 conv 32 3 x 3 / 1 16 x 16 x 16 -> 16 x 16 x 32 0.002 BFLOPs
4 conv 32 3 x 3 / 1 16 x 16 x 32 -> 16 x 16 x 32 0.005 BFLOPs
5 max 2 x 2 / 2 16 x 16 x 32 -> 8 x 8 x 32
6 conv 32 3 x 3 / 1 8 x 8 x 32 -> 8 x 8 x 32 0.001 BFLOPs
7 conv_TA 32 3 x 3 / 1 8 x 8 x 32 -> 8 x 8 x 32 0.001 BFLOPs
8 max_TA 2 x 2 / 2 8 x 8 x 32 -> 4 x 4 x 32
9 connected_TA 512 -> 64
darknetp: TEEC_InvokeCommand(FC) failed 0xffff3024 origin 0x3
Then, I moved the partition point to layer 8 and got TEEC_InvokeCommand(backward) error, shown below.
# darknetp classifier train -pp 8 cfg/cifar.data cfg/vgg-7_cifar10.cfg
Prepare session with the TA
Begin darknet
vgg-7_cifar10
1
layer filters size input output
0 conv 16 3 x 3 / 1 32 x 32 x 3 -> 32 x 32 x 16 0.001 BFLOPs
1 conv 16 3 x 3 / 1 32 x 32 x 16 -> 32 x 32 x 16 0.005 BFLOPs
2 max 2 x 2 / 2 32 x 32 x 16 -> 16 x 16 x 16
3 conv 32 3 x 3 / 1 16 x 16 x 16 -> 16 x 16 x 32 0.002 BFLOPs
4 conv 32 3 x 3 / 1 16 x 16 x 32 -> 16 x 16 x 32 0.005 BFLOPs
5 max 2 x 2 / 2 16 x 16 x 32 -> 8 x 8 x 32
6 conv 32 3 x 3 / 1 8 x 8 x 32 -> 8 x 8 x 32 0.001 BFLOPs
7 conv 32 3 x 3 / 1 8 x 8 x 32 -> 8 x 8 x 32 0.001 BFLOPs
8 max_TA 2 x 2 / 2 8 x 8 x 32 -> 4 x 4 x 32
9 connected_TA 512 -> 64
10 dropout_TA p = 0.80 64 -> 64
11 connected_TA 64 -> 10
12 softmax_TA 10
13 cost_TA 10
Learning Rate: 0.01, Momentum: 0.9, Decay: 5e-05
3000
32 28
output file: /media/results/train_vgg-7_cifar10_pp8.txt
current_batch=0
Loaded: 0.391526 seconds
darknetp: TEEC_InvokeCommand(backward) failed 0xffff3024 origin 0x3
It worked fine if the partition point is set at fully connected layer, layer 9.
# darknetp classifier train -pp 9 cfg/cifar.data cfg/vgg-7_cifar10.cfg
Prepare session with the TA
Begin darknet
vgg-7_cifar10
1
layer filters size input output
0 conv 16 3 x 3 / 1 32 x 32 x 3 -> 32 x 32 x 16 0.001 BFLOPs
1 conv 16 3 x 3 / 1 32 x 32 x 16 -> 32 x 32 x 16 0.005 BFLOPs
2 max 2 x 2 / 2 32 x 32 x 16 -> 16 x 16 x 16
3 conv 32 3 x 3 / 1 16 x 16 x 16 -> 16 x 16 x 32 0.002 BFLOPs
4 conv 32 3 x 3 / 1 16 x 16 x 32 -> 16 x 16 x 32 0.005 BFLOPs
5 max 2 x 2 / 2 16 x 16 x 32 -> 8 x 8 x 32
6 conv 32 3 x 3 / 1 8 x 8 x 32 -> 8 x 8 x 32 0.001 BFLOPs
7 conv 32 3 x 3 / 1 8 x 8 x 32 -> 8 x 8 x 32 0.001 BFLOPs
8 max 2 x 2 / 2 8 x 8 x 32 -> 4 x 4 x 32
9 connected_TA 512 -> 64
10 dropout_TA p = 0.80 64 -> 64
11 connected_TA 64 -> 10
12 softmax_TA 10
13 cost_TA 10
Learning Rate: 0.01, Momentum: 0.9, Decay: 5e-05
3000
32 28
output file: /media/results/train_vgg-7_cifar10_pp9.txt
current_batch=0
Loaded: 0.431457 seconds
1, 0.017: 0.000000, 0.000000 avg, 0.009992 rate, 62.028127 seconds, 50 images
user CPU start: 0.210258; end: 62.005439
kernel CPU start: 7.809596; end: 9.176006
Max: 32172 kilobytes
vmsize:75192; vmrss:31768; vmdata:70864; vmstk:132; vmexe:268; vmlib:1872
Loaded: 0.000641 seconds
So it seemed to me that fully connected layers are working if put in TrustZone. Other layer types (maxpool and convolution) crashes the program. Any help is appreciated!
Hi, for the errorTEEC_InvokeCommand(FC)
, it seems the FC layer even cannot be created, and I agree with you the reason that the TEE memory cannot hold this layer.
One possible reason for the TEEC_InvokeCommand(backward)
is still the TEE memory is not enough. As backward pass will require much memory in addition to creating the model, and forward pass. This could be why the model works when you put less number of layers.
You may want to calculate the parameters for each layer mainly Conv and FC layers, this can help to estimate memory usage for each layer. The TEE is very small relatively when one layer has many parameters e.g., the FC layer you used 512 x 64
.
Hope this helps!
I'm closing this issue. feel free to reopen if you have more updates
Hi @mofanv
I am facing a similar problem when I tried to train the model, but this time I got TEEC_InvokeCommand(backward) failed 0xffff0006 origin 0x4
error.
Here's the log:
# darknetp classifier train -pp_start 4 -pp_end 10 cfg/mnist.dataset cfg/mnist_lenet.cfg
Prepare session with the TA
Begin darknet
mnist_lenet
1
layer filters size input output
0 conv 6 5 x 5 / 1 28 x 28 x 3 -> 28 x 28 x 6 0.001 BFLOPs
1 max 2 x 2 / 2 28 x 28 x 6 -> 14 x 14 x 6
2 conv 6 5 x 5 / 1 14 x 14 x 6 -> 14 x 14 x 6 0.000 BFLOPs
3 max 2 x 2 / 2 14 x 14 x 6 -> 7 x 7 x 6
4 connected_TA 294 -> 120
5 dropout_TA p = 0.80 120 -> 120
6 connected_TA 120 -> 84
7 dropout_TA p = 0.80 84 -> 84
8 connected_TA 84 -> 10
9 softmax_TA 10
10 cost_TA 10
workspace_size=235200
Learning Rate: 0.01, Momentum: 0.9, Decay: 5e-05
3000
28 28
output file: /media/results/train_mnist_lenet_pps4_ppe10.txt
current_batch=0
Loaded: 0.284791 seconds
darknetp: TEEC_InvokeCommand(backward) failed 0xffff0006 origin 0x4
I found in optee_os/lib/libutee/include/tee_api_defines.h that error code 0xffff0006 refers to TEE_ERROR_BAD_PARAMETERS.
I have tried moving the partition point, but got the same error:
# darknetp classifier train -pp_start 9 -pp_end 10 cfg/mnist.dataset cfg/mnist_lenet.cfg
Prepare session with the TA
Begin darknet
mnist_lenet
1
layer filters size input output
0 conv 6 5 x 5 / 1 28 x 28 x 3 -> 28 x 28 x 6 0.001 BFLOPs
1 max 2 x 2 / 2 28 x 28 x 6 -> 14 x 14 x 6
2 conv 6 5 x 5 / 1 14 x 14 x 6 -> 14 x 14 x 6 0.000 BFLOPs
3 max 2 x 2 / 2 14 x 14 x 6 -> 7 x 7 x 6
4 connected 294 -> 120
5 dropout p = 0.80 120 -> 120
6 connected 120 -> 84
7 dropout p = 0.80 84 -> 84
8 connected 84 -> 10
9 softmax_TA 10
10 cost_TA 10
workspace_size=235200
Learning Rate: 0.01, Momentum: 0.9, Decay: 5e-05
3000
28 28
output file: /media/results/train_mnist_lenet_pps9_ppe10.txt
current_batch=0
Loaded: 0.303591 seconds
darknetp: TEEC_InvokeCommand(backward) failed 0xffff0006 origin 0x4
Do you have any idea what's happening here?
Thanks.
Hi @shibz-islam , I just tried the master branch, and it ran as expected without errors. Can you please give more details such as are you using a real device or simulation? and it is the latest version of optee?
Hi @mofanv, thanks for the quick response. I am using simulation with QEMUv8 and the version of OP-TEE is 3.8.0. Do you think the problem could be with the version of OP-TEE as I am not using the latest version?
I think 3.8.0 should be Ok. I have used this version.
The error is quite strange. As far as I remember, I didn't experience the TEE_ERROR_BAD_PARAMETERS
before. Maybe it is because the params
are set wrong when invoking the backward function? but as I can run it correctly, so it shouldn't be the case. There is no error in this application.
I guess one more possible reason is from the setup of the environment. I suggest starting from the beginning or trying another machine if you have
Okay I will try from the beginning. Thank you for your help.
@mofanv It is working now! Actually I had two darknetz projects in the build directory. May be this was the reason of the problem. Once I setup the environment again with the latest version of the OP-TEE, I didn't face that error again. Really appreciate your help.