darknetz icon indicating copy to clipboard operation
darknetz copied to clipboard

Convolutional layer and maxpool layer are not working in TrustZone

Open ychen404 opened this issue 4 years ago • 8 comments

Hi mofan,

I cannot train a model if the convolutional layers and maxpool layers are partitioned in the TrustZone. I tested vgg-7 model with cifar10 using the following command.

# darknetp classifier train -pp 4 cfg/cifar.data cfg/vgg-7_cifar10.cfg

I got the TEEC_InvokeCommand(FC) error.

Prepare session with the TA
Begin darknet
vgg-7_cifar10
1
layer     filters    size              input                output
    0 conv     16  3 x 3 / 1    32 x  32 x   3   ->    32 x  32 x  16  0.001 BFLOPs
    1 conv     16  3 x 3 / 1    32 x  32 x  16   ->    32 x  32 x  16  0.005 BFLOPs
    2 max          2 x 2 / 2    32 x  32 x  16   ->    16 x  16 x  16
    3 conv     32  3 x 3 / 1    16 x  16 x  16   ->    16 x  16 x  32  0.002 BFLOPs
    4 conv_TA   32  3 x 3 / 1    16 x  16 x  32   ->    16 x  16 x  32  0.005 BFLOPs
    5 max_TA       2 x 2 / 2    16 x  16 x  32   ->     8 x   8 x  32
    6 conv_TA   32  3 x 3 / 1     8 x   8 x  32   ->     8 x   8 x  32  0.001 BFLOPs
    7 conv_TA   32  3 x 3 / 1     8 x   8 x  32   ->     8 x   8 x  32  0.001 BFLOPs
    8 max_TA       2 x 2 / 2     8 x   8 x  32   ->     4 x   4 x  32
    9 connected_TA                          512  ->    64
darknetp: TEEC_InvokeCommand(FC) failed 0xffff3024 origin 0x3

I suspected that the error is related to model size, so I move the partition point to layer 7 using the command below.

# darknetp classifier train -pp 7 cfg/cifar.data cfg/vgg-7_cifar10.cfg

But I still got the same error.

Prepare session with the TA
Begin darknet
vgg-7_cifar10
1
layer     filters    size              input                output
    0 conv     16  3 x 3 / 1    32 x  32 x   3   ->    32 x  32 x  16  0.001 BFLOPs
    1 conv     16  3 x 3 / 1    32 x  32 x  16   ->    32 x  32 x  16  0.005 BFLOPs
    2 max          2 x 2 / 2    32 x  32 x  16   ->    16 x  16 x  16
    3 conv     32  3 x 3 / 1    16 x  16 x  16   ->    16 x  16 x  32  0.002 BFLOPs
    4 conv     32  3 x 3 / 1    16 x  16 x  32   ->    16 x  16 x  32  0.005 BFLOPs
    5 max          2 x 2 / 2    16 x  16 x  32   ->     8 x   8 x  32
    6 conv     32  3 x 3 / 1     8 x   8 x  32   ->     8 x   8 x  32  0.001 BFLOPs
    7 conv_TA   32  3 x 3 / 1     8 x   8 x  32   ->     8 x   8 x  32  0.001 BFLOPs
    8 max_TA       2 x 2 / 2     8 x   8 x  32   ->     4 x   4 x  32
    9 connected_TA                          512  ->    64
darknetp: TEEC_InvokeCommand(FC) failed 0xffff3024 origin 0x3

Then, I moved the partition point to layer 8 and got TEEC_InvokeCommand(backward) error, shown below.

# darknetp classifier train -pp 8 cfg/cifar.data cfg/vgg-7_cifar10.cfg
Prepare session with the TA
Begin darknet
vgg-7_cifar10
1
layer     filters    size              input                output
    0 conv     16  3 x 3 / 1    32 x  32 x   3   ->    32 x  32 x  16  0.001 BFLOPs
    1 conv     16  3 x 3 / 1    32 x  32 x  16   ->    32 x  32 x  16  0.005 BFLOPs
    2 max          2 x 2 / 2    32 x  32 x  16   ->    16 x  16 x  16
    3 conv     32  3 x 3 / 1    16 x  16 x  16   ->    16 x  16 x  32  0.002 BFLOPs
    4 conv     32  3 x 3 / 1    16 x  16 x  32   ->    16 x  16 x  32  0.005 BFLOPs
    5 max          2 x 2 / 2    16 x  16 x  32   ->     8 x   8 x  32
    6 conv     32  3 x 3 / 1     8 x   8 x  32   ->     8 x   8 x  32  0.001 BFLOPs
    7 conv     32  3 x 3 / 1     8 x   8 x  32   ->     8 x   8 x  32  0.001 BFLOPs
    8 max_TA       2 x 2 / 2     8 x   8 x  32   ->     4 x   4 x  32
    9 connected_TA                          512  ->    64
   10 dropout_TA    p = 0.80                 64  ->    64
   11 connected_TA                           64  ->    10
   12 softmax_TA                                       10
   13 cost_TA                                          10
Learning Rate: 0.01, Momentum: 0.9, Decay: 5e-05
3000
32 28
output file: /media/results/train_vgg-7_cifar10_pp8.txt
current_batch=0 
Loaded: 0.391526 seconds
darknetp: TEEC_InvokeCommand(backward) failed 0xffff3024 origin 0x3

It worked fine if the partition point is set at fully connected layer, layer 9.

# darknetp classifier train -pp 9 cfg/cifar.data cfg/vgg-7_cifar10.cfg
Prepare session with the TA
Begin darknet
vgg-7_cifar10
1
layer     filters    size              input                output
    0 conv     16  3 x 3 / 1    32 x  32 x   3   ->    32 x  32 x  16  0.001 BFLOPs
    1 conv     16  3 x 3 / 1    32 x  32 x  16   ->    32 x  32 x  16  0.005 BFLOPs
    2 max          2 x 2 / 2    32 x  32 x  16   ->    16 x  16 x  16
    3 conv     32  3 x 3 / 1    16 x  16 x  16   ->    16 x  16 x  32  0.002 BFLOPs
    4 conv     32  3 x 3 / 1    16 x  16 x  32   ->    16 x  16 x  32  0.005 BFLOPs
    5 max          2 x 2 / 2    16 x  16 x  32   ->     8 x   8 x  32
    6 conv     32  3 x 3 / 1     8 x   8 x  32   ->     8 x   8 x  32  0.001 BFLOPs
    7 conv     32  3 x 3 / 1     8 x   8 x  32   ->     8 x   8 x  32  0.001 BFLOPs
    8 max          2 x 2 / 2     8 x   8 x  32   ->     4 x   4 x  32
    9 connected_TA                          512  ->    64
   10 dropout_TA    p = 0.80                 64  ->    64
   11 connected_TA                           64  ->    10
   12 softmax_TA                                       10
   13 cost_TA                                          10
Learning Rate: 0.01, Momentum: 0.9, Decay: 5e-05
3000
32 28
output file: /media/results/train_vgg-7_cifar10_pp9.txt
current_batch=0 
Loaded: 0.431457 seconds
1, 0.017: 0.000000, 0.000000 avg, 0.009992 rate, 62.028127 seconds, 50 images
user CPU start: 0.210258; end: 62.005439
kernel CPU start: 7.809596; end: 9.176006
Max: 32172  kilobytes
vmsize:75192; vmrss:31768; vmdata:70864; vmstk:132; vmexe:268; vmlib:1872
Loaded: 0.000641 seconds

So it seemed to me that fully connected layers are working if put in TrustZone. Other layer types (maxpool and convolution) crashes the program. Any help is appreciated!

ychen404 avatar Jun 11 '20 18:06 ychen404

Hi, for the errorTEEC_InvokeCommand(FC), it seems the FC layer even cannot be created, and I agree with you the reason that the TEE memory cannot hold this layer.

One possible reason for the TEEC_InvokeCommand(backward) is still the TEE memory is not enough. As backward pass will require much memory in addition to creating the model, and forward pass. This could be why the model works when you put less number of layers.

You may want to calculate the parameters for each layer mainly Conv and FC layers, this can help to estimate memory usage for each layer. The TEE is very small relatively when one layer has many parameters e.g., the FC layer you used 512 x 64. Hope this helps!

mofanv avatar Jun 18 '20 08:06 mofanv

I'm closing this issue. feel free to reopen if you have more updates

mofanv avatar Oct 08 '20 15:10 mofanv

Hi @mofanv

I am facing a similar problem when I tried to train the model, but this time I got TEEC_InvokeCommand(backward) failed 0xffff0006 origin 0x4 error.

Here's the log:

# darknetp classifier train -pp_start 4 -pp_end 10 cfg/mnist.dataset cfg/mnist_lenet.cfg
Prepare session with the TA
Begin darknet
mnist_lenet
1
layer     filters    size              input                output
    0 conv      6  5 x 5 / 1    28 x  28 x   3   ->    28 x  28 x   6  0.001 BFLOPs
    1 max          2 x 2 / 2    28 x  28 x   6   ->    14 x  14 x   6
    2 conv      6  5 x 5 / 1    14 x  14 x   6   ->    14 x  14 x   6  0.000 BFLOPs
    3 max          2 x 2 / 2    14 x  14 x   6   ->     7 x   7 x   6
    4 connected_TA                          294  ->   120
    5 dropout_TA    p = 0.80                120  ->   120
    6 connected_TA                          120  ->    84
    7 dropout_TA    p = 0.80                 84  ->    84
    8 connected_TA                           84  ->    10
    9 softmax_TA                                       10
   10 cost_TA                                          10
workspace_size=235200
Learning Rate: 0.01, Momentum: 0.9, Decay: 5e-05
3000
28 28
output file: /media/results/train_mnist_lenet_pps4_ppe10.txt
current_batch=0 
Loaded: 0.284791 seconds
darknetp: TEEC_InvokeCommand(backward) failed 0xffff0006 origin 0x4

I found in optee_os/lib/libutee/include/tee_api_defines.h that error code 0xffff0006 refers to TEE_ERROR_BAD_PARAMETERS.

I have tried moving the partition point, but got the same error:

# darknetp classifier train -pp_start 9 -pp_end 10 cfg/mnist.dataset cfg/mnist_lenet.cfg
Prepare session with the TA
Begin darknet
mnist_lenet
1
layer     filters    size              input                output
    0 conv      6  5 x 5 / 1    28 x  28 x   3   ->    28 x  28 x   6  0.001 BFLOPs
    1 max          2 x 2 / 2    28 x  28 x   6   ->    14 x  14 x   6
    2 conv      6  5 x 5 / 1    14 x  14 x   6   ->    14 x  14 x   6  0.000 BFLOPs
    3 max          2 x 2 / 2    14 x  14 x   6   ->     7 x   7 x   6
    4 connected                             294  ->   120
    5 dropout       p = 0.80                120  ->   120
    6 connected                             120  ->    84
    7 dropout       p = 0.80                 84  ->    84
    8 connected                              84  ->    10
    9 softmax_TA                                       10
   10 cost_TA                                          10
workspace_size=235200
Learning Rate: 0.01, Momentum: 0.9, Decay: 5e-05
3000
28 28
output file: /media/results/train_mnist_lenet_pps9_ppe10.txt
current_batch=0 
Loaded: 0.303591 seconds
darknetp: TEEC_InvokeCommand(backward) failed 0xffff0006 origin 0x4

Do you have any idea what's happening here?
Thanks.

shibz-islam avatar Dec 08 '20 00:12 shibz-islam

Hi @shibz-islam , I just tried the master branch, and it ran as expected without errors. Can you please give more details such as are you using a real device or simulation? and it is the latest version of optee?

mofanv avatar Dec 08 '20 09:12 mofanv

Hi @mofanv, thanks for the quick response. I am using simulation with QEMUv8 and the version of OP-TEE is 3.8.0. Do you think the problem could be with the version of OP-TEE as I am not using the latest version?

shibz-islam avatar Dec 08 '20 10:12 shibz-islam

I think 3.8.0 should be Ok. I have used this version.

The error is quite strange. As far as I remember, I didn't experience the TEE_ERROR_BAD_PARAMETERS before. Maybe it is because the params are set wrong when invoking the backward function? but as I can run it correctly, so it shouldn't be the case. There is no error in this application. I guess one more possible reason is from the setup of the environment. I suggest starting from the beginning or trying another machine if you have

mofanv avatar Dec 08 '20 22:12 mofanv

Okay I will try from the beginning. Thank you for your help.

shibz-islam avatar Dec 08 '20 22:12 shibz-islam

@mofanv It is working now! Actually I had two darknetz projects in the build directory. May be this was the reason of the problem. Once I setup the environment again with the latest version of the OP-TEE, I didn't face that error again. Really appreciate your help.

shibz-islam avatar Dec 17 '20 08:12 shibz-islam