frugally-deep icon indicating copy to clipboard operation
frugally-deep copied to clipboard

Poor performance with big dilation rates (was "Performance on Branching Models" before)

Open kmader opened this issue 7 years ago • 16 comments

Performance on fairly simple 1D model with stacked branches / bottlenecks / inception blocks was surprisingly poor (>300x slower).

Keras on CPU (tensorflow lite for the same model is 170µs)

Using the 1 CPU/Thread trick

import tensorflow as tf, keras
session_conf = tf.ConfigProto(
      intra_op_parallelism_threads=1,
      inter_op_parallelism_threads=1)
sess = tf.Session(config=session_conf)
keras.backend.set_session(sess)
model = model_from_json(json_blob)
%%time
model.predict(np.zeros((1, 600, 3)));
CPU times: user 88.1 ms, sys: 35.1 ms, total: 123 ms
Wall time: 25.1 ms

Frugally Deep App

I realize the loop here includes loading and parsing the model, but that is a small fraction of the compute time (as can be seen by changing the length of the sequence)

!time ../build/main
Loading model
Loading json ... done. elapsed time: 0.192233 s
Running test 1 of 1 ... done. elapsed time: 1.472293 s
Loading, constructing, testing of ../model.json took 3.309670 s overall.
Running all zeros: 

real	0m31.150s
user	0m30.712s
sys	0m0.245s

Model Details

Model as Keras H5, FrugallyDeep JSON, and C++ file for inference: https://gist.github.com/kmader/135db41c5ea35c0dc8cae95ed90087f4

out_model

kmader avatar Mar 27 '19 09:03 kmader

My suspicion would be the model being evaluated from finish to start and isn't caching any of the intermediate results and recomputing every value every time (which would be 3*8*8*8 as much work), but I haven't had time to dive into the code and figure out if that is really the case.

kmader avatar Mar 27 '19 09:03 kmader

My suspicion would be the model being evaluated from finish to start and isn't caching any of the intermediate results and recomputing every value every time (which would be 388*8 as much work)

Good thinking, but that is not the case. Frugally-deep does cache the output of each node in the computational graph. :)

I just ran your code on my machine, and the output is as follows:

Loading model
Loading json ... done. elapsed time: 0.100226 s
Running test 1 of 1 ... done. elapsed time: 0.034413 s
Loading, constructing, testing of ./model.json took 0.250775 s overall.
Running all zeros: 

real    0m0.863s
user    0m0.742s
sys     0m0.116s

The first forward pass (the test one) took 0.034413 s.

So I would guess the reason for the bad performance you see is much more simple, e.g.:

  • You did not tell your compiler to produce optimized machine code.
  • You are using hardware from the 90s.
  • You are mining too much bitcoin in the background.

OK, jokes apart, it's very likely the first of the three above. Please check this part of the FAQ for more details: https://github.com/Dobiasd/frugally-deep/blob/master/FAQ.md#why-is-my-prediction-roughly-100-times-slower-in-c-as-in-python

Dobiasd avatar Mar 27 '19 12:03 Dobiasd

Wow, thanks I have never seen that crazy of a speed up between -g and -03, issue resolved!

Loading model
Loading json ... done. elapsed time: 0.064939 s
Running test 1 of 1 ... done. elapsed time: 0.037276 s
Loading, constructing, testing of ../model.json took 0.270976 s overall.
Running all zeros: 

real	0m0.990s
user	0m0.797s
sys	0m0.186s

kmader avatar Mar 27 '19 12:03 kmader

Running it on a vector sized 1, 1, 18000, 3 it seems to scale very differently (I guess the keras/tf runtime is mostly overhead while in frugally deep it is mostly compute, but is now a factor of 60

Loading model
Loading json ... done. elapsed time: 0.059423 s
Running test 1 of 1 ... done. elapsed time: 0.037828 s
Loading, constructing, testing of ../model.json took 0.263522 s overall.
Running all zeros: 

real	0m29.779s
user	0m20.428s
sys	0m7.359s

Keras Evaluation

%%time
_ = model.predict(np.random.uniform(size=(1, 3600, 3)))
CPU times: user 493 ms, sys: 211 ms, total: 705 ms
Wall time: 143 ms

kmader avatar Mar 27 '19 13:03 kmader

Sorry, I'm not sure I understand correctly.

So when using larger input vectors, you again have a performance problem with frugally-deep, or not?

Dobiasd avatar Mar 27 '19 14:03 Dobiasd

Wow, thanks I have never seen that crazy of a speed up between -g and -03, issue resolved!

Yeah, Frugally-deep and especially FunctionalPlus, which is used, rely heavily on compiler optimizations. I tried to write the code as readable as possible for humans. For this I avoided manual optimizations, but always checked (with profiling) if the compilers are able to do them for us. :)

Dobiasd avatar Mar 27 '19 14:03 Dobiasd

Well on the summary of the very poorly conducted benchmarks above,

Vector Size Keras Frugally Deep
600 0.088 0.990
18000 0.493 29.779

As the data gets larger by a factor of 30, Keras takes 5x longer and Frugally Deep takes 30x longer. So there are some other things coming into play in tensorflow that let it handle larger datasets better.

kmader avatar Mar 27 '19 14:03 kmader

Huh, that is interesting.

I tested your model with the following code:

#include "fdeep/fdeep.hpp"
int main()
{
    const std::size_t test_runs = 3;
    const auto model = fdeep::load_model("model.json", true);
    fdeep::tensor5s inputs = {fdeep::tensor5(fdeep::shape5(1, 1, 1, 18000, 3), 0)};
    fplus::stopwatch stopwatch;
    for (std::size_t i = 0; i < test_runs; ++i)
    {
        std::cout << "Starting test run " << i << "." << std::endl;
        model.predict(inputs);
    }
    const double duration_avg = stopwatch.elapsed() / static_cast<double>(test_runs);
    std::cout << "Forward pass took " << duration_avg << " s on average." << std::endl;
}

and had this output:

Loading json ... done. elapsed time: 0.098866 s
Running test 1 of 1 ... done. elapsed time: 0.034203 s
Loading, constructing, testing of model.json took 0.254174 s overall.
Starting test run 0
Starting test run 1
Starting test run 2
Forward pass took 19.488 s on average.

So indeed, it is quite slow.


The equivalent Python code

from keras.models import load_model
import numpy as np
import datetime

test_runs = 3
model = load_model('out_model.h5')
data_in = np.zeros((1, 18000, 3))
start_time = datetime.datetime.now()
for i in range(test_runs):
    print('Starting test run {i}.')
    model.predict(data_in)
end_time = datetime.datetime.now()
duration_avg = (end_time - start_time) / test_runs
print('Forward pass took {} s on average.'.format(duration_avg))

ran with:

CUDA_VISIBLE_DEVICES='' taskset --cpu-list 1 python3 main.py

is much faster:

Starting test run 0.
Starting test run 1.
Starting test run 2.
Forward pass took 0:00:01.365790 s on average.

I'll see if I can find out what is going on there, and will get back to you with the results.

Dobiasd avatar Mar 27 '19 14:03 Dobiasd

Mhh, the profiler output does not look that much different compared to VGG19 runs as an example:

Your model: Y8nKFMJ

VGG19: fedoxop

So I don't yet know what is happening.

Dobiasd avatar Mar 27 '19 15:03 Dobiasd

To make sure the caching actually is working, I added

std::cout << "Apply layer '" << name_ << "'." << std::endl;

to the beginning of fdeep::layer::apply.

The output looks OK:

Apply layer 'C1D_Stacked'.
Apply layer 'diff'.
Apply layer ''.
Apply layer 'PaddingDiffEdges'.
Apply layer 'batch_normalization_2'.
Apply layer 'conv1d_6'.
Apply layer ''.
Apply layer 'conv1d_7'.
Apply layer ''.
Apply layer 'C1D_L0_D1'.
Apply layer ''.
Apply layer 'C1D_L0_D2'.
Apply layer ''.
Apply layer 'C1D_L0_D4'.
Apply layer ''.
Apply layer 'C1D_L0_D8'.
Apply layer ''.
Apply layer 'C1D_L0_D16'.
Apply layer ''.
Apply layer 'C1D_L0_D32'.
Apply layer ''.
Apply layer 'C1D_L0_D64'.
Apply layer ''.
Apply layer 'concatenate_3'.
Apply layer 'BN_L0'.
Apply layer 'Relu_L0'.
Apply layer 'conv1d_8'.
Apply layer ''.
Apply layer 'C1D_L1_D1'.
Apply layer ''.
Apply layer 'C1D_L1_D2'.
Apply layer ''.
Apply layer 'C1D_L1_D4'.
Apply layer ''.
Apply layer 'C1D_L1_D8'.
Apply layer ''.
Apply layer 'C1D_L1_D16'.
Apply layer ''.
Apply layer 'C1D_L1_D32'.
Apply layer ''.
Apply layer 'C1D_L1_D64'.
Apply layer ''.
Apply layer 'add_2'.
Apply layer 'BN_L1'.
Apply layer 'Relu_L1'.
Apply layer 'C1D_L2_D1'.
Apply layer ''.
Apply layer 'C1D_L2_D2'.
Apply layer ''.
Apply layer 'C1D_L2_D4'.
Apply layer ''.
Apply layer 'C1D_L2_D8'.
Apply layer ''.
Apply layer 'C1D_L2_D16'.
Apply layer ''.
Apply layer 'C1D_L2_D32'.
Apply layer ''.
Apply layer 'C1D_L2_D64'.
Apply layer ''.
Apply layer 'concatenate_4'.
Apply layer 'BN_L2'.
Apply layer 'Relu_L2'.
Apply layer 'SmoothingStep'.
Apply layer 'cropping1d_7'.
Apply layer 'zero_padding1d_3'.
Apply layer 'conv1d_9'.
Apply layer ''.
Apply layer 'drive_right_down'.
Apply layer ''.
Apply layer 'conv1d_10'.
Apply layer ''.
Apply layer 'hiccup_event'.
Apply layer ''.
Apply layer 'drive_sep_features'.
Apply layer ''.
Apply layer 'drive_sep_output'.
Apply layer ''.
Apply layer 'cropping1d_8'.
Apply layer 'zero_padding1d_4'.
Apply layer 'drive_dist'.
Apply layer ''.

So no layer is applied more than once.

:man_shrugging:

Dobiasd avatar Mar 27 '19 15:03 Dobiasd

Thanks, definitely seems to be a good approach. Unfortunately trying to figure out what tensorflow does with the model will probably be a lot messier.

The caching would be the easiest problem to fix, I guess it is possibly some optimization that is being applied on the TF side or maybe how it deals with large dilation steps (_D64 means dilation_rate=64) that could be causing the problem? (the model for reference is sort of a stacked-wavenet which is why those dilations come up https://deepmind.com/blog/wavenet-generative-model-raw-audio/)

kmader avatar Mar 27 '19 15:03 kmader

Ah, very good hint. I did not catch this property of the model.

Yes, such a dilation rate will cause frugally-deep to create huge tensors, consisting of mostly 0s. Here is the corresponding part of the code: https://github.com/Dobiasd/frugally-deep/blob/bd9ca973a669d602beaedc9d1a291ca20b071716/include/fdeep/tensor5.hpp#L597

Subsequently, these big tensors are put into the convolution step, which consequently take a lot of time then.

Dobiasd avatar Mar 27 '19 15:03 Dobiasd

I guess now we need a hero to step forward and dive into the convolution code of frugally-deep, finding and implementing some trick to better cope with dilation. :goberserk:

Anybody of the attendees, perhaps? :eyes:

Dobiasd avatar Mar 27 '19 15:03 Dobiasd

Yea that makes sense, I can try to figure out what tensorflow lite does as maybe there is something easy that works for this.

kmader avatar Mar 27 '19 21:03 kmader

@kmader How is it going? Did you find out anything? :)

Dobiasd avatar Jul 05 '19 09:07 Dobiasd

@Dobiasd sorry I started looking into it but got caught up by other projects, I'll try to dive a bit deeper next week and see if I can find anything

kmader avatar Jul 05 '19 21:07 kmader

@besmaGuesmi That's no problem. In the FAQ you can find explanations on how to use images as input (convert image to fdeep::tensor) and on how to convert the output of fdeep::model::predict, i.e., an fdeep::tensor, back to an image.

P.S. You posted your question as a comment in an issue that is about "performance with big dilation rates", which is unrelated to your question. In the future, please don't do this, but instead open a new issue, so concerns can stay separated.

Dobiasd avatar Aug 14 '21 05:08 Dobiasd