fieldtrip Performance enhancements of FieldTrip functions via Parallel Computing Toolbox code constructs

After quantifying performance bottlenecks and prioritizing certain high-level or low-level FieldTrip functions (see #1851), and optimizing the code (see #1852), further performance enhancements can be implemented using parallel execution constructs like parfor or using GPU computing.

Aug 12 '21 12:08 robertoostenveld

Found a few candidates for this parfor conversion so far, based on evaluating the tutorial test code: ft_componentanalysis, ft_freqanalysis, ft_networkanalysis, and ft_timelockanalysis. (Should probably be no surprise that they're all "*analysis" functions.) While they all have a main loop structure that looks amenable to parfor use, doing a direct drop-in conversion hasn't worked; they'll require a little refactoring to do so: ft_timelockanalysis needs to either invert the loop and the test for the keeptrials case or split the not-keeptrials case into separate calculation and summation steps, and ft_freqanalysis and ft_component analysis need some tweaking as to how they collect some of their output variables. Proceeding to work on that.

Dec 17 '21 18:12 apjanke

I have created pull requests for the adaptations of ft_timelockanalysis and ft_networkanalysis.

As noted in both, I haven't found testcases that really spend a significant amount of time in the main loops of these functions. Should there be any, that are based on ftp data?

@robertoostenveld One possible question: should it be possible to also run them explicitly NOT in parallel? Depending on the data, running in parallel might not do much. It should not be hugely detrimental, but if nothing else uses the parallel pool, just starting it up could take some time. It is possible to add an option to the cfg for parfor_maxnumworkers, that can be set to 0 to not use the parpool at all. I could add that to both PRs yet.

Jul 01 '22 12:07 AljenU

Most test cases in fieldtrip/test are on purpose designed to be small memory and take little time. To evaluate how more realistic scenario's evaluate, they should be "blown up" to process larger amounts of data, or should be executed many times.

Could you make a report that shows what the speedups are for some specific test cases? Perhaps as a google sheet, with one column that links to the specific test cased, one column for the original time and one for the time when using parfor

Jul 01 '22 12:07 robertoostenveld

I have added my timing-information creation scripts in https://github.com/AljenU/fieldtrip-parallel-support/tree/additional_tests, which now has a directory Timingscripts.

The results are for

'orig', the state of the fieldtrip master that my branches are based on
'parfor', with the for loop in the analysis changed to a parfor loop, and using as many workers as available
'zeropar', with the for loop in the analysis changed to a parfor loop, but having temporarily set the maximum number of workers to zero

Note that this is the tic-toc time for the full function being run. The specific analysis being parfor-ed only takes very limited time within that, at maximum some 20 seconds.

The results so far, after running on the computer here:

>> timing_disp
timelock
               ft_bench_01_timelockanalysis_tut_15    ft_bench_01_timelockanalysis_01    ft_bench_01_timelockanalysis_01a    ft_bench_01_timelockanalysis_02    ft_bench_01_timelockanalysis_02a    ft_ex_10_ica_remove_ecg_artifacts
               ___________________________________    _______________________________    ________________________________    _______________________________    ________________________________    _________________________________

    orig                     27.551                               7.4407                              6.9812                             54.011                              60.287                              916.27              
    parfor                   17.057                               8.6438                               6.952                             56.673                              60.974                              777.09              
    zeropar                  27.484                                6.733                              7.2749                             52.738                              64.923                              770.38              

network
               ft_bench_04_networkanalysis_01    ft_bench_04_networkanalysis_02
               ______________________________    ______________________________

    orig                   3.0717                            207.03            
    parfor                 9.4921                             211.5            
    zeropar               0.96706                            194.47

Jul 07 '22 10:07 AljenU

for reference, this relates to #2068 and #2069

Jul 13 '22 13:07 robertoostenveld

Interesting first results @AljenU.

One thing that jumps out at me is how 'zeropar' came out faster than 'parfor' in one of your ft_networkanalysis test cases. I'm going to look into best practices for parallelization testing here, but that para smells to me of a "first-run" artifact that has more to do with just-in-time compilation of the MATLAB execution engine versus the actual "steady-state" performance.

Also: can you remind how many workers you currently have available on your test setup?

Jul 14 '22 12:07 vijayiyer05

I do want to rerun all the tests again at some point, to see if results are similar. Because it also depends on what else the computer is doing at the time. And timing differences can also come from other parts of the functions, not necessarily the analysis functions that we are interested in.

Through the preamble / postamble, fieldtrip already also timing information per analysis function displayed in the command window, but I have not yet looked at getting that info out. That would allow to show more of the specific effects. EDIT: I think it may be possible to use that info, by overriding ft_postamble_provenance with a custom version, that stores the relevant timing info in the base workspace, and then retrieving it from the base workspace to add to the timing info.

And, these test cases do not have that much load in the specific parfor-ed functions. Maybe i could do a multiple-times loop around the shorter ones, to get longer runtimes. But even so, I am not convinced that would adequately represent the performance differences for cases where there is significant time spent in these functions. Whereas that are actually the most interesting cases.

As for the computer, I'm doing the tests with PCT and a 'local' cluster, which has 4 workers.

Jul 14 '22 12:07 AljenU

Thanks @AljenU, a couple of f/u questions that our engineers were quickly curious about:

Can you give some spec on the 'local' cluster, i.e. laptop or small desktop?
When you say you set "max workers to zero", we're not sure what you mean...was this actually 1 worker? Or if there was something you set to zero, we'd like to understand better.

Jul 15 '22 14:07 vijayiyer05

On the specs of the computer I am running on:

Intel Core i7-3770
32 Gb memory

On the "max workers to zero"; For example the loop in ft_timelockanalysis is specified with parfor (k = 1:nrpt, cfg.parformaxworkers). The zeropar runs mean that cfg.parformaxworkers == 0 in that case.

From the docs: "parfor (loopVar = initVal:endVal, M); statements; end uses M to specify the maximum number of workers from the parallel pool to use in evaluating statements in the loop body. M must be a nonnegative integer. [......] When no workers are available in the pool or M is zero, MATLAB still executes the loop body in a nondeterministic order, but not in parallel. Use this syntax to switch between parallel and serial execution when testing your code."

Aug 11 '22 13:08 AljenU

The timing-information creation scripts have been updated in https://github.com/AljenU/fieldtrip-parallel-support/tree/additional_tests, in the directory Timingscripts.

The results are for

'orig', the state of the fieldtrip master that my branches are based on
'parfor', with the for loop in the analysis changed to a parfor loop, and using as many workers as available
'zeropar', with the for loop in the analysis changed to a parfor loop, but having temporarily set the maximum number of workers to zero: on the line that has cfg.parformaxworkers = ft_getopt(cfg, 'parformaxworkers', Inf);, change the Inf to 0.

Note that this is the tic-toc time for the full function being run. The specific analysis being parfor-ed may only take part of that time. And, timing differences with 'orig' are not necessarily just from the parfor changes. Thus, interpreting the below info requires that also the profiling reports are looked at.

The results so far, after running on the computer here:

>> timing_disp
timelock
               ft_bench_01_timelockanalysis_tut_15    ft_bench_01_timelockanalysis_01    ft_bench_01_timelockanalysis_01a    ft_bench_01_timelockanalysis_02    ft_bench_01_timelockanalysis_02a
               ___________________________________    _______________________________    ________________________________    _______________________________    ________________________________

    orig                     25.522                               7.3034                              6.1964                             56.212                              57.725             
    parfor                   22.844                               6.3291                              7.1351                             49.071                              56.675             
    zeropar                  23.499                               6.1671                              6.6356                             54.804                              56.557             

network
               ft_bench_04_networkanalysis_01    ft_bench_04_networkanalysis_02
               ______________________________    ______________________________

    orig                  0.41793                            190.58            
    parfor                 2.5161                            245.77            
    zeropar               0.55953                            254.77            

freq
               ft_bench_02_freqanalysis_01    ft_bench_03_connectivityanalysis_01    ft_tut_13_timefrequency_analysis    ft_tut_15_sensor_analysis    ft_check_test_freqanalysis    ft_bench_02_freqanalysis_02
               ___________________________    ___________________________________    ________________________________    _________________________    __________________________    ___________________________

    orig                 32.412                             8.9909                                 44.64                          42.731                         4.709                        6.7622           
    zeropar              36.629                             12.122                                48.868                          49.834                        5.2022                        8.1828           
    parfor               28.053                             7.5658                                40.474                          36.913                        5.0091                         5.159

Aug 19 '22 14:08 AljenU

Improved and final timing information

In the branch https://github.com/AljenU/fieldtrip-parallel-support/tree/additional_tests_plus_additional_timings, adaptations to the timing-information scripts have been made, to be able to focus the timing data on only the analysis functions that have been parfored. Also, timeit can now be used, which should give better timing info, especially when the call time is very short.

To quote save_input's help text: SAVE_INPUT Save reference input to testfun, as used in runfun, for later use in a timing collection function. Note that the data gathering is not yet as efficient as it could be, as in some functions the same data is used, and only the processing option is different in a next call. Thus, the same data will be in multiple files. An even better update would be to make dedicated data-generation functions based on the runfuns, instead of saving the input data to mat file. For now, this was the quickest way to gather specific timing info.

The results are for

'orig', the state of the fieldtrip master that my branches are based on
'parfor_infWorkers', with the for loop in the analysis changed to a parfor loop, and using as many workers as available
'parfor_zeroWorkers', with the for loop in the analysis changed to a parfor loop, but having temporarily set the maximum number of workers to zero: on the line that has cfg.parformaxworkers = ft_getopt(cfg, 'parformaxworkers', Inf);, change the Inf to 0.

network
           OriginalVariableNames            original    parfor_infWorkers    parfor_zeroWorkers
    ____________________________________    ________    _________________    __________________

    {'ft_bench_04_networkanalysis_01_1'}     0.02016         0.10702              0.021489     
    {'ft_bench_04_networkanalysis_01_2'}     0.03171         0.12377              0.038955     
    {'ft_bench_04_networkanalysis_02_1'}      2.1438          4.2516                2.4925     
    {'ft_bench_04_networkanalysis_02_2'}    0.025568        0.060879              0.025597     
    {'ft_bench_04_networkanalysis_02_3'}      2.3452          4.3554                2.3503     
    {'ft_bench_04_networkanalysis_02_4'}      2.0348          4.3565                2.3257     
    {'ft_bench_04_networkanalysis_02_5'}     0.02367        0.059325              0.032633     
    {'ft_bench_04_networkanalysis_02_6'}    0.023339        0.057511              0.028371     

timelock
              OriginalVariableNames              original    parfor_infWorkers    parfor_zeroWorkers
    _________________________________________    ________    _________________    __________________

    {'ft_bench_01_timelockanalysis_01_1'    }    0.14098          0.16314              0.14971      
    {'ft_bench_01_timelockanalysis_01_2'    }    0.13984          0.14627              0.14113      
    {'ft_bench_01_timelockanalysis_01a_1'   }    0.34035          0.44899              0.37266      
    {'ft_bench_01_timelockanalysis_01a_2'   }    0.31996          0.40028              0.34832      
    {'ft_bench_01_timelockanalysis_02_1'    }     3.0244           2.9337               2.9241      
    {'ft_bench_01_timelockanalysis_02_2'    }    0.70603          0.67824              0.67443      
    {'ft_bench_01_timelockanalysis_02a_1'   }     9.4952           7.8172               9.7048      
    {'ft_bench_01_timelockanalysis_02a_2'   }     1.7309           1.9167                 1.83      
    {'ft_bench_01_timelockanalysis_tut_15_1'}     7.0511           5.9916               7.2398      
    {'ft_bench_01_timelockanalysis_tut_15_2'}     13.138           11.399               13.879      

freq
              OriginalVariableNames              original    parfor_infWorkers    parfor_zeroWorkers
    _________________________________________    ________    _________________    __________________

    {'ft_bench_02_freqanalysis_01_1'        }     5.1144           2.9158               5.7823      
    {'ft_bench_02_freqanalysis_01_2'        }      1.247          0.44718              0.71857      
    {'ft_bench_02_freqanalysis_01_3'        }     11.364           6.9759               12.078      
    {'ft_bench_02_freqanalysis_01_4'        }     3.7793           2.8125               4.9761      
    {'ft_bench_02_freqanalysis_02_1'        }     5.1869           2.7334               6.2094      
    {'ft_bench_03_connectivityanalysis_01_1'}     5.0026           2.6976               5.0723      
    {'ft_check_test_freqanalysis_1'         }    0.12545          0.17587              0.13835      
    {'ft_check_test_freqanalysis_2'         }    0.15652           0.1707              0.18098      
    {'ft_check_test_freqanalysis_3'         }    0.20071          0.18581              0.23457      
    {'ft_check_test_freqanalysis_4'         }    0.17596           0.1781              0.19535      
    {'ft_tut_13_timefrequency_analysis_1'   }     4.3246           2.9001               5.0213      
    {'ft_tut_13_timefrequency_analysis_2'   }     1.1425          0.46207              0.71292      
    {'ft_tut_13_timefrequency_analysis_3'   }     11.551           7.1077               12.301      
    {'ft_tut_13_timefrequency_analysis_4'   }     5.3873           2.9557               6.5254      
    {'ft_tut_15_sensor_analysis_1'          }     19.352           12.697                 20.7      
    {'ft_tut_15_sensor_analysis_2'          }     5.2238           2.8335               5.5849

Sep 29 '22 08:09 AljenU

What was changed

Parallel constructs were added in ft_timelockanalysis, ft_networkanalysis and ft_freqanalysis, as recorded in this branch. An earlier effort had identified ft_timelockanalysis, ft_networkanalysis and ft_freqanalysis as functions that might benefit from adding parallel constructs. Thus, to evaluate application of parallel constructs in high-level Fieldtrip analysis functions, these three functions were adapted. In particular, these functions have a for loop as part of their analysis, and changing those to a parfor loop should easily integrate parallelization in these functions.

How it was changed

In each of the three functions, the for loop was replaced with a parfor loop. Using a parfor loop introduces constraints on how data can be used. The consequences will be detailed here, with reference to an example code change. When referring to examples, this is to the implementation on the 'parallel' branch on the fieldtrip github repository.

Using a parfor loop requires that the interpreter can determine that each loop iteration can be executed independent of the other iterations. Thus there are restrictions on how data is used inside the loop. These are well documented in the Matlab documentation. It mainly affects variables that are both written to inside the loop, and also used outside the loop.

If data is assigned to a subset of a variable, it must be done in exactly the same way in each iteration of the loop. In current fieldtrip implementations, often a switch-case or if-else construct is used to select different data assignments, also within loops. However, depending on the exact use-case, it is possible to refactor the fieldtrip implementations to have similar assignments within each branch within the loop, and still have a different data layout and usage for further processing. The changes with regard to the variable output in ft_networkanalysis show an example of how this can be done, without increasing the memory footprint of the variable. In this case by adding extra dimensions of size 1 to the data, and using a (cheap) reshape after the loop to remove those extra dimensions.
A variable cannot be both a reduction variable and a subset-assigned variable within the loop. This is a variant of the restriction that assignments need to be similar in each loop iteration. This situation is present in ft_timelockanalysis for the variable covsig, and in this case there were two ways to deal with it. The first way to deal with it is to use only the subset-way of assigning to the variable, and do the reduction after the loop. However, this would increase the memory footprint, since all intermediate results would be kept until the loop is finished. And the reduction appears to be present exactly to reduce the memory footprint. The second way to deal with the non-similar assignment, is to reverse the order of entering-the-loop and switching-for-different-data-assignment. Then it is possible to keep the memory footprint of the variable the same. The implementation in ft_timelockanalysis shows an example of this. There, each branch of the if keeptrials switch has its own parfor loop (instead of a single loop with an if statement inside it).
The limits of for loops nested in the parfor-loop must be specific. "You must define the range of a for-loop nested in a parfor-loop by constant numbers or broadcast variables." The change in ft_networkanalysis for the upper limit of the inner loop shows an example of how this can be done.

For more efficient memory use, there are additional recommendations, also available through the Matlab documentation. These are on variables that are used, but not changed, inside the loop. If only a subset of the data in the variable is needed in an iteration, the interpreter should be enabled to determine that indeed only a subset is needed on the parallel worker.

Use only the needed subset of struct type variables. The full struct will be sent to each parallel worker when, within the loop, any fields are referenced through dot notation within the struct variable. If only a limited number of fields is needed within the loop, it is possible to limit how much of a struct type variable is sent to a worker. This is done by adding a variable that only holds the value of the needed field, and use that new variable inside the loop. The use of the variable datacov_trial in ft_timelockanalysis is an example. Note that, due to Matlab's copy-on-write, the value will NOT be duplicated in the main thread.
Use the recommended, as per the Matlab documentation, way of indexing into variables where only a subset of the data is used in a particular iteration. If the parfor loop index is used for the subset identification, only the subset used in the iteration will be sent to the parallel worker. The use of data_trial in ft_freqanalysis is an example.

To give the user control over the amount of parallelization, including not doing the loops in parallel, an extra input option was added to the three functions. A cfg option was introduced to control the amount of workers used, cfg.parformaxworkers = ft_getopt(cfg, 'parformaxworkers' , Inf);. Setting cfg.parformaxworkers to 0, the parfor loops will not use parallel computing. The default setting of Inf will use all available workers in the running parallel pool.

Different amounts of adaptation were needed

In the previous section, the examples on what was changed have referred mainly to ft_timelockanalysis and ft_networkanalysis. This was done, because the amount of changes between the parallel branch and the parent are very limited for these functions. Thus, it is more easy to see the individual restrictions and recommendations being applied.

In ft_freqanalysis, many more changes were made. The function had a for loop that assigns to multiple variables, and does the assignment differently for each of the variables, using the common fieldtrip pattern of having logical branching within the loop. The git commit history of ft_freqanalysis shows the various steps that were taken, to finally conform with the restrictions for using parfor. Together they show the effort needed to limit the memory footprint of variables, while implementing parallel constructs inside a fieldtrip function that uses multiple non-uniform data assignments.

A pattern that needed to be replaced specifically in ft_freqanalysis, was the allocation of memory for loop-output variables within the first iteration of the loop. To be able to use a parfor loop, the memory for the loop-output variables needs to be assigned before the loop is started. Only then can the first restriction mentioned earlier be adhered to. In order to do so without extensive code duplication, the memory-allocation part of the loop, and the before-the-memory-allocation part of the loop, were each put into their own subfunction. That way, the code duplication is limited to a single line: the call of the new before-the-memory-allocation subfunction, which is called both one time before the loop, and once in each iteration of the loop.

Measured effects on runtime

The effects of the changes on runtime are different for the three functions, and (as expected) also depend on the specific input to a function. The effects were measured on a computer with an Intel Core i7-3770 and 32 Gb memory. The detailed timing results are available here. The results compare three cases: the original code, the parfored code version when run in parallel, and the parfored code version when run with parallelization disabled.

For ft_timelockanalysis changing the main loop from for to parfor has only small positive effects, if any, with the tested inputs. The small positive effects (a runtime decrease up to about 15%) do happen mainly for the longer runs, and when the loop is indeed run in parallel. This suggests that the change might be beneficial in actual workflows, where even more / larger data is processed.
For ft_networkanalysis changing the main loop from for to parfor has only negative effects, if any, with the tested inputs. The negative effects occur when running in parallel, where the runtime can be double the original. When parallelization is disabled, the parfored code version has the same runtime as the original.
For ft_freqanalysis changing the main loop from for to parfor has generally positive effects. When run in parallel, the runtime is almost halved in quite some cases, and decreased in nearly all. The only cases where the runtime does not decrease, are the ones that have a very short runtime to start with. When parallelization is disabled, the parfored code version has approximately the same runtime as the original, though some slight increases in runtime are observed.

Side-effects

Running an analysis in parallel will increase the total amount of memory needed. Even though the implementations did take into account the per-variable memory footprint, dividing the calculations over multiple workers will increase the total memory footprint. Because functionality and data that is needed in each iteration, also from Matlab itself, will be duplicated. Thus, if an analysis is already memory-limited when run sequentially, running it in parallel on the same computer can cause memory-overflow, and actually greatly increase the runtime of the calculation.

Using a parfor loop where each iteration does very little work can increase the runtime. The overhead of sending data to and receiving data from the parallel workers, will in that case take more time than what is saved by dividing the total number of calculations over multiple workers.

Conclusions

Changing for loops to parfor loops can reduce the runtime of certain high-level analyses, see for example the measured effects on ft_freqanalysis. In particular when an analysis takes more than a few seconds to run, and extra computer memory is available, enabling parallelization can speed up the workflow.

Whether a specific user will see a speedup for his workflow, when parfors are introduced, will at least depend on the specifications of the computer the analysis is run on. It might improve significantly, it could be similar, but it can also be much worse due to memory overflow. Thus, if parfors are introduced, it might be prudent to make the default number of workers equal to 0, and require the user to explicitly enable parallelization throught the cfg option.

Whether a specific user will see a speedup for his workflow, when parfors are introduced, will at least depend on the analyses that are part of the workflow. Certain analyses are more likely to benefit from parallelization than others. Therefore, deciding which analysis should have parallelization introduced should take into account the typical workflow, including typical data sizes, that include such an analysis.

Looking at the way many high-level analysis functions are now structured, introducing parallelization requires significant code changes. High-level analysis functions currently can have many lines of code per function, and extensive logical branches within the main calculation part of the funcion, including within loops. To avoid code duplication and / or to adhere to the parfor restrictions, such functions need significant changes to their structure. Most of these structure changes are actually also recommendable from a general software development perspective, but will need to be weighed against other aims the current structures may have.

Sep 29 '22 11:09 AljenU

@AljenU thanks for the very clear report. I will add this to the documentation of the parallel branch.

Sep 29 '22 12:09 robertoostenveld

@vijayiyer05 and @AljenU I have documented it in the PARALLEL.md file that is on the specific branch.

Oct 13 '22 11:10 robertoostenveld

fieldtrip fieldtrip copied to clipboard

Performance enhancements of FieldTrip functions via Parallel Computing Toolbox code constructs

What was changed

How it was changed

Different amounts of adaptation were needed

Measured effects on runtime

Side-effects

Conclusions

fieldtrip
fieldtrip copied to clipboard