tslearn icon indicating copy to clipboard operation
tslearn copied to clipboard

Remove redundant second prange in inner for-loop

Open NimaSarajpoor opened this issue 3 years ago • 19 comments

According to the numba document, Loop Serialization occurs when we have more than one prange in the nested loops. In these cases, only the outmost prange is executed in parallel. This PR just clean the code a little bit.


NOTE: this is neither a bug nor a feature request. So, I couldn't open an issue as I couldn't find any proper option there.

NimaSarajpoor avatar Jul 01 '22 08:07 NimaSarajpoor

Interesting. Would you be able to benchmark execution speeds before and after your suggestion?

rtavenar avatar Jul 01 '22 08:07 rtavenar

Interesting. Would you be able to benchmark execution speeds before and after your suggestion?

Sure I can. Just to confirm, the python setup.py install should suffice when I am in this branch. Right? After doing git add/commit , I just need to run python setup.py install to make sure my changes are effective. Is there anything else that I should know? (like...should I do pip uninstall tslearn before doing the installation again?)

NimaSarajpoor avatar Jul 01 '22 20:07 NimaSarajpoor

Performance Code:

# in function tslearn/metrics/dtw_variants.py

# test on function `njit_accumulated_matrix_from_dist_matrix`

# input:
seed = 0
np.random.seed(seed)

D = np.random.rand(10000, 10000)
mask = np.full_like(D, 1, dtype=bool)

njit_accumulated_matrix_from_dist_matrix(D, mask) # throw away

lst = []
for i in range(100):
    tic = time.time()
    njit_accumulated_matrix_from_dist_matrix(D, mask)
    toc = time.time()
    
    lst.append(toc - tic)
    
np.mean(lst) 

So, I averaged the computing time over 100 iterations:

existing version: 0.79965
remove the `prange` in inner loop: 0.76393 

Does this suffice? If not, could you please let me know what you have in mind?

NimaSarajpoor avatar Jul 02 '22 02:07 NimaSarajpoor

Hello @NimaSarajpoor and @rtavenar,

Here is the current documentation for the automatic parallelization with @jit:
https://numba.readthedocs.io/en/stable/user/parallel.html

From what I understood, @njit is equivalent to @jit(nopython=True): http://numba.pydata.org/numba-doc/0.12.2/modules/numba.html#numba.decorators.njit https://github.com/numba/numba/blob/main/numba/core/decorators.py

By default, @njit and @jit have the parameter "parallel=False": https://numba.readthedocs.io/en/stable/reference/jit-compilation.html#numba.jit

In the file on which we are currently working: https://github.com/tslearn-team/tslearn/blob/main/tslearn/metrics/dtw_variants.py the decorator @njit() is therefore always used with the default parameter: "parallel=False".

In the current documentation for the automatic parallelization with @jit:
https://numba.readthedocs.io/en/stable/user/parallel.html we can read: "Without "parallel=True" in the jit-decorator, the prange statement is equivalent to range."

As mentioned by @NimaSarajpoor, even if we use: @njit(parallel=True), it is written in the documentation that: "Loop serialization occurs when any number of prange driven loops are present inside another prange driven loop. In this case the outermost of all the prange loops executes in parallel and any inner prange loops (nested or otherwise) are treated as standard range based loops. Essentially, nested parallelism does not occur."

As we do not use the option "parallel=True" here, I think that we can change all "prange" occurrences into "range". Doing this, we can avoid to import "prange". It might bring clarity, but I do not think that it will change the speed of the computations in a significant way.

What do you think about it?

YannCabanes avatar Jul 06 '22 05:07 YannCabanes

@YannCabanes

we can read: "Without "parallel=True" in the jit-decorator, the prange statement is equivalent to range."

Nice catch! I totally overlooked this part.

As we do not use the option "parallel=True" here, I think that we can change all "prange" occurrences into "range"

What do you think about it?

I think we just need to go with passing the keyword argument parallel=True to the decorator @njit( ) and we should not remove prange as we want to parallelize the process.

However, I will wait and see what @rtavenar has to say regarding this matter.

NimaSarajpoor avatar Jul 06 '22 06:07 NimaSarajpoor

One cannot parallelize these loops since array values are computed using recurrence.

rtavenar avatar Jul 06 '22 06:07 rtavenar

Hello @rtavenar, Do you think that we should transform all "prange" into "range" and remove "prange" from the imports? (It might bring clarity as we do not parallelize the loops; we can also avoid to import "prange".) Or do you prefer to let it as it is now? (Changing "prange" into "range" will not speed up the codes in a significant way.)

YannCabanes avatar Jul 06 '22 16:07 YannCabanes

That seems fine to me. @NimaSarajpoor can you make sure that we do not degrade performance by doing so?

rtavenar avatar Jul 07 '22 07:07 rtavenar

@NimaSarajpoor can you make sure that we do not degrade performance by doing so?

I will run the performance again after changing prange to range.

@YannCabanes Thanks for your input!

NimaSarajpoor avatar Jul 07 '22 07:07 NimaSarajpoor

Hello @NimaSarajpoor, did you compare the performances between prange and range?

YannCabanes avatar Jul 21 '22 20:07 YannCabanes

@YannCabanes

Not yet. I was busy with resolving issue #413 as I couldn't run pytest. I solved the issue and I will resume the investigation in a few days.

NimaSarajpoor avatar Jul 21 '22 22:07 NimaSarajpoor

So, I removed the redundant inner prange and I also pass parallel=True to decorator @njit where appropriate (i.e. where we have a prange). I am providing the performances below.

btw, I will push commits soon in a day (or two)

functions revised version main branch
njit_accumulated_matrix_from_dist_matrix 0.23 0.22
sakoe_chiba_mask 0.11 0.11
_njit_itakura_mask 0.148 0.15
itakura_mask 0.487 0.46
njit_lb_envelope 0.047 0.2
local dist 0.01 0.01

Update: I corrected the numbers in the table. Apparently, I made a mistake before. It would be nice if someone can run the code after I push the commits

And, here is the performance code:

Click to expand!
# import time

def get_performances():
    # njit_accumulated_matrix_from_dist_matrix
    njit_accumulated_matrix_from_dist_matrix(
        numpy.random.rand(5, 5),
        numpy.full((5,5), 1, dtype=numpy.bool_)
    ) # throw away

    seed = 0
    numpy.random.seed(seed)
    dist_matrix = numpy.random.rand(5000, 5000)
    mask = numpy.full_like(dist_matrix, 1, dtype=numpy.bool_)

    lst = []
    for i in range(50):
        tic = time.time()
        njit_accumulated_matrix_from_dist_matrix(dist_matrix, mask)
        toc = time.time()
        lst.append(toc - tic)

    out1 = numpy.mean(lst[10:-10])
    ###################

    #sakoe_chiba_mask
    sakoe_chiba_mask(4, 4, 1) # throw away

    lst = []
    for i in range(50):
        tic = time.time()
        sakoe_chiba_mask(5000, 5000, 1)
        toc = time.time()
        lst.append(toc - tic)

    out2 = numpy.mean(lst[10:-10])
    ####################

    # _njit_itakura_mask
    _njit_itakura_mask(4, 4, 1) # throw away

    lst = []
    for i in range(50):
        tic = time.time()
        _njit_itakura_mask(5000, 5000, 2.0)
        toc = time.time()
        lst.append(toc - tic)

    out3 = numpy.mean(lst[10:-10])
    #####################

    #itakura_mask
    itakura_mask(4, 4, 1) # throw away

    lst = []
    for i in range(50):
        tic = time.time()
        itakura_mask(5000, 5000, 2.0)
        toc = time.time()
        lst.append(toc - tic)

    out4 = numpy.mean(lst[10:-10])
    ######################

    # njit_lb_envelope
    njit_lb_envelope(numpy.random.rand(5, 1), 1) # throw away

    seed = 0
    numpy.random.seed(seed)
    T = numpy.random.rand(10000, 5)
    w = len(T)//10

    lst = []
    for i in range(50):
        tic = time.time()
        njit_lb_envelope(T, w)
        toc = time.time()
        lst.append(toc - tic)

    out5 = numpy.mean(lst[10:-10])
    ###########################

    # _local_squared_dist
    _local_squared_dist(numpy.random.rand(3), numpy.random.rand(3)) # throw away

    seed = 0
    numpy.random.seed(seed)
    x = numpy.random.rand(10000000) # 1e7
    y = numpy.random.rand(10000000) # 1e7

    lst = []
    for _ in range(50):
        tic = time.time()
        _local_squared_dist(x, y)
        toc = time.time()
        lst.append(toc - tic)

    out6 = numpy.mean(lst[10:-10])
    ############################

    return [out1, out2, out3, out4, out5, out6]

NimaSarajpoor avatar Jul 29 '22 02:07 NimaSarajpoor

@rtavenar @GillesVandewiele @YannCabanes FYI:

(1) It seems a comma was missing in tslearn/metrics/__init__.py!

(2) ALL tests in tslearn/tests are passing in my local repo.

NimaSarajpoor avatar Jul 29 '22 18:07 NimaSarajpoor

I have checked and the failing tests seem to be the same than the failing tests of tslearn's main branch. @rtavenar, do you think that this PR is ready to be merged?

YannCabanes avatar Aug 01 '22 05:08 YannCabanes

I think it is not ready. I just reviewed the changes and I noticed that I added parallel=True in function def _njit_itakura_mask but I forgot to use prange in the outer for-loop. So, I think I need to correct that (i.e. use prange) and push again. So, I will wait to see if there is anything else that I should revise or not.

NimaSarajpoor avatar Aug 01 '22 05:08 NimaSarajpoor

Hello @NimaSarajpoor, Except for the function "njit_lb_envelope" which is faster in the revised version than in the main branch, the other functions have almost the same computation time in the revised version and in the main branch. When we write "parallel=True", is it possible to be sure that the computations are really performed in parallel and to know how many processors are running in parallel? I guess that the number of processors used depends on the functions. Does it also depend on the machine on which the code is run?

YannCabanes avatar Aug 02 '22 02:08 YannCabanes

When we write "parallel=True", is it possible to be sure that the computations are really performed in parallel and to know how many processors are running in parallel? I guess that the number of processors used depends on the functions. Does it also depend on the machine on which the code is run?

Yeah... good question! I do not know right now...but I will work on it and will get back to you :) I may run the performance test again.

NimaSarajpoor avatar Aug 02 '22 17:08 NimaSarajpoor

Hello @NimaSarajpoor, Maybe that you have already seen it, there is section: "How to measure the performance of Numba?" on the webpage: https://numba.readthedocs.io/en/stable/user/5minguide.html#how-to-measure-the-performance-of-numba

The following warning is written:

First, recall that Numba has to compile your function for the argument types given before it executes the machine code version of your function. This takes time. However, once the compilation has taken place Numba caches the machine code version of your function for the particular types of arguments presented. If it is called again with the same types, it can reuse the cached version instead of having to compile again. A really common mistake when measuring performance is to not account for the above behaviour and to time code once with a simple timer that includes the time taken to compile your function in the execution time.

YannCabanes avatar Aug 03 '22 23:08 YannCabanes

Thanks for the info. Yeah. I do a similar thing. I posted the performance code in one of the previous comments in which you can see the approach I took for measuring the computing time.

NimaSarajpoor avatar Aug 04 '22 06:08 NimaSarajpoor