tslearn
tslearn copied to clipboard
Remove redundant second prange in inner for-loop
According to the numba document, Loop Serialization occurs when we have more than one prange in the nested loops. In these cases, only the outmost prange is executed in parallel. This PR just clean the code a little bit.
NOTE: this is neither a bug nor a feature request. So, I couldn't open an issue as I couldn't find any proper option there.
Interesting. Would you be able to benchmark execution speeds before and after your suggestion?
Interesting. Would you be able to benchmark execution speeds before and after your suggestion?
Sure I can. Just to confirm, the python setup.py install should suffice when I am in this branch. Right? After doing git add/commit , I just need to run python setup.py install to make sure my changes are effective. Is there anything else that I should know? (like...should I do pip uninstall tslearn before doing the installation again?)
Performance Code:
# in function tslearn/metrics/dtw_variants.py
# test on function `njit_accumulated_matrix_from_dist_matrix`
# input:
seed = 0
np.random.seed(seed)
D = np.random.rand(10000, 10000)
mask = np.full_like(D, 1, dtype=bool)
njit_accumulated_matrix_from_dist_matrix(D, mask) # throw away
lst = []
for i in range(100):
tic = time.time()
njit_accumulated_matrix_from_dist_matrix(D, mask)
toc = time.time()
lst.append(toc - tic)
np.mean(lst)
So, I averaged the computing time over 100 iterations:
existing version: 0.79965
remove the `prange` in inner loop: 0.76393
Does this suffice? If not, could you please let me know what you have in mind?
Hello @NimaSarajpoor and @rtavenar,
Here is the current documentation for the automatic parallelization with @jit:
https://numba.readthedocs.io/en/stable/user/parallel.html
From what I understood, @njit is equivalent to @jit(nopython=True): http://numba.pydata.org/numba-doc/0.12.2/modules/numba.html#numba.decorators.njit https://github.com/numba/numba/blob/main/numba/core/decorators.py
By default, @njit and @jit have the parameter "parallel=False": https://numba.readthedocs.io/en/stable/reference/jit-compilation.html#numba.jit
In the file on which we are currently working: https://github.com/tslearn-team/tslearn/blob/main/tslearn/metrics/dtw_variants.py the decorator @njit() is therefore always used with the default parameter: "parallel=False".
In the current documentation for the automatic parallelization with @jit:
https://numba.readthedocs.io/en/stable/user/parallel.html
we can read:
"Without "parallel=True" in the jit-decorator, the prange statement is equivalent to range."
As mentioned by @NimaSarajpoor, even if we use: @njit(parallel=True), it is written in the documentation that: "Loop serialization occurs when any number of prange driven loops are present inside another prange driven loop. In this case the outermost of all the prange loops executes in parallel and any inner prange loops (nested or otherwise) are treated as standard range based loops. Essentially, nested parallelism does not occur."
As we do not use the option "parallel=True" here, I think that we can change all "prange" occurrences into "range". Doing this, we can avoid to import "prange". It might bring clarity, but I do not think that it will change the speed of the computations in a significant way.
What do you think about it?
@YannCabanes
we can read: "Without "parallel=True" in the jit-decorator, the prange statement is equivalent to range."
Nice catch! I totally overlooked this part.
As we do not use the option "parallel=True" here, I think that we can change all "prange" occurrences into "range"
What do you think about it?
I think we just need to go with passing the keyword argument parallel=True to the decorator @njit( ) and we should not remove prange as we want to parallelize the process.
However, I will wait and see what @rtavenar has to say regarding this matter.
One cannot parallelize these loops since array values are computed using recurrence.
Hello @rtavenar, Do you think that we should transform all "prange" into "range" and remove "prange" from the imports? (It might bring clarity as we do not parallelize the loops; we can also avoid to import "prange".) Or do you prefer to let it as it is now? (Changing "prange" into "range" will not speed up the codes in a significant way.)
That seems fine to me. @NimaSarajpoor can you make sure that we do not degrade performance by doing so?
@NimaSarajpoor can you make sure that we do not degrade performance by doing so?
I will run the performance again after changing prange to range.
@YannCabanes Thanks for your input!
Hello @NimaSarajpoor, did you compare the performances between prange and range?
@YannCabanes
Not yet. I was busy with resolving issue #413 as I couldn't run pytest. I solved the issue and I will resume the investigation in a few days.
So, I removed the redundant inner prange and I also pass parallel=True to decorator @njit where appropriate (i.e. where we have a prange). I am providing the performances below.
btw, I will push commits soon in a day (or two)
| functions | revised version | main branch |
|---|---|---|
| njit_accumulated_matrix_from_dist_matrix | 0.23 | 0.22 |
| sakoe_chiba_mask | 0.11 | 0.11 |
| _njit_itakura_mask | 0.148 | 0.15 |
| itakura_mask | 0.487 | 0.46 |
| njit_lb_envelope | 0.047 | 0.2 |
| local dist | 0.01 | 0.01 |
Update: I corrected the numbers in the table. Apparently, I made a mistake before. It would be nice if someone can run the code after I push the commits
And, here is the performance code:
Click to expand!
# import time
def get_performances():
# njit_accumulated_matrix_from_dist_matrix
njit_accumulated_matrix_from_dist_matrix(
numpy.random.rand(5, 5),
numpy.full((5,5), 1, dtype=numpy.bool_)
) # throw away
seed = 0
numpy.random.seed(seed)
dist_matrix = numpy.random.rand(5000, 5000)
mask = numpy.full_like(dist_matrix, 1, dtype=numpy.bool_)
lst = []
for i in range(50):
tic = time.time()
njit_accumulated_matrix_from_dist_matrix(dist_matrix, mask)
toc = time.time()
lst.append(toc - tic)
out1 = numpy.mean(lst[10:-10])
###################
#sakoe_chiba_mask
sakoe_chiba_mask(4, 4, 1) # throw away
lst = []
for i in range(50):
tic = time.time()
sakoe_chiba_mask(5000, 5000, 1)
toc = time.time()
lst.append(toc - tic)
out2 = numpy.mean(lst[10:-10])
####################
# _njit_itakura_mask
_njit_itakura_mask(4, 4, 1) # throw away
lst = []
for i in range(50):
tic = time.time()
_njit_itakura_mask(5000, 5000, 2.0)
toc = time.time()
lst.append(toc - tic)
out3 = numpy.mean(lst[10:-10])
#####################
#itakura_mask
itakura_mask(4, 4, 1) # throw away
lst = []
for i in range(50):
tic = time.time()
itakura_mask(5000, 5000, 2.0)
toc = time.time()
lst.append(toc - tic)
out4 = numpy.mean(lst[10:-10])
######################
# njit_lb_envelope
njit_lb_envelope(numpy.random.rand(5, 1), 1) # throw away
seed = 0
numpy.random.seed(seed)
T = numpy.random.rand(10000, 5)
w = len(T)//10
lst = []
for i in range(50):
tic = time.time()
njit_lb_envelope(T, w)
toc = time.time()
lst.append(toc - tic)
out5 = numpy.mean(lst[10:-10])
###########################
# _local_squared_dist
_local_squared_dist(numpy.random.rand(3), numpy.random.rand(3)) # throw away
seed = 0
numpy.random.seed(seed)
x = numpy.random.rand(10000000) # 1e7
y = numpy.random.rand(10000000) # 1e7
lst = []
for _ in range(50):
tic = time.time()
_local_squared_dist(x, y)
toc = time.time()
lst.append(toc - tic)
out6 = numpy.mean(lst[10:-10])
############################
return [out1, out2, out3, out4, out5, out6]
@rtavenar @GillesVandewiele @YannCabanes FYI:
(1) It seems a comma was missing in tslearn/metrics/__init__.py!
(2) ALL tests in tslearn/tests are passing in my local repo.
I have checked and the failing tests seem to be the same than the failing tests of tslearn's main branch. @rtavenar, do you think that this PR is ready to be merged?
I think it is not ready. I just reviewed the changes and I noticed that I added parallel=True in function def _njit_itakura_mask but I forgot to use prange in the outer for-loop. So, I think I need to correct that (i.e. use prange) and push again. So, I will wait to see if there is anything else that I should revise or not.
Hello @NimaSarajpoor, Except for the function "njit_lb_envelope" which is faster in the revised version than in the main branch, the other functions have almost the same computation time in the revised version and in the main branch. When we write "parallel=True", is it possible to be sure that the computations are really performed in parallel and to know how many processors are running in parallel? I guess that the number of processors used depends on the functions. Does it also depend on the machine on which the code is run?
When we write "parallel=True", is it possible to be sure that the computations are really performed in parallel and to know how many processors are running in parallel? I guess that the number of processors used depends on the functions. Does it also depend on the machine on which the code is run?
Yeah... good question! I do not know right now...but I will work on it and will get back to you :) I may run the performance test again.
Hello @NimaSarajpoor, Maybe that you have already seen it, there is section: "How to measure the performance of Numba?" on the webpage: https://numba.readthedocs.io/en/stable/user/5minguide.html#how-to-measure-the-performance-of-numba
The following warning is written:
First, recall that Numba has to compile your function for the argument types given before it executes the machine code version of your function. This takes time. However, once the compilation has taken place Numba caches the machine code version of your function for the particular types of arguments presented. If it is called again with the same types, it can reuse the cached version instead of having to compile again. A really common mistake when measuring performance is to not account for the above behaviour and to time code once with a simple timer that includes the time taken to compile your function in the execution time.
Thanks for the info. Yeah. I do a similar thing. I posted the performance code in one of the previous comments in which you can see the approach I took for measuring the computing time.