tedana icon indicating copy to clipboard operation
tedana copied to clipboard

Adding robustica option to ICA decomposition to achieve consistent results

Open BahmanTahayori opened this issue 1 year ago • 10 comments

Given that the FastICA method depends on its seed value, in some cases the generated output can be substantially different. To overcome this issue, we added robustica (https://github.com/CRG-CNAG/robustica/tree/main/robustica) which is a python library based on Icasso as an option to the ICA decomposition. robustica clusters many iterations of FastICA to achieve consistent results.

Changes proposed in this pull request:

  • ica_method is added as an option (robustica or fastica) and the user can select between the two methods. The number of runs for when robustica is used can be determined by the user through setting n_robust_runs. The default is set to `robustica' with 30 number of robust runs. This default value is selected based on our experience using robustica for two different datasets with over 250 subjects' data.

  • The DBSCAN clustering algorithm has been implemented in a hard-coded manner.

BahmanTahayori avatar Dec 20 '23 01:12 BahmanTahayori

I'm really excited to see this @BahmanTahayori. I'm going to start looking at it now. I asked on https://mattermost.brainhack.org/brainhack/channels/tedana and no one was definitely planning to attend our tedana dev call this Thursday/Friday so I cancelled that meeting. If you were getting this ready to discuss at that meeting, I'll either recreate the meeting, even if it's just the two of us, or we can schedule a meeting at another time that's slightly more convenient in Australia.

Also, some of your style issues are due to us allowing recently python 3.12, which has a few more stringencies. Addressing the following might get the style check passing:

tedana/decomposition/ica.py:143:5: E722 do not use bare 'except'
tedana/decomposition/pca.py:397:71: E226 missing whitespace around arithmetic operator
tedana/decomposition/pca.py:397:88: E231 missing whitespace after ','
tedana/selection/selection_nodes.py:324:38: E231 missing whitespace after ','

handwerkerd avatar Dec 20 '23 15:12 handwerkerd

@BahmanTahayori and I talked yesterday. In addition to going through the above review comments, I wanted to highlight one other part of our discussion to other developers (@tsalo @eurunuela @dowdlelt @n-reddy @javiergcas @goodalse2019 @NaomiGaggi @marco7877 @CesarCaballeroGaudes ). Robust ICA seems to always output fewer components than initially requested. This is reasonable. If you have 100 time points and request 90 components, it should not be able to find 90 stable components. This is both a problem and a possible solution for our current issues with PCA dimensionality estimation.

On the problem side, even if we get a perfect dimensionality estimate to input into RobustICA, and RobustICA will still output fewer ICA components.

On the solution side, I don't really care about how many components are estimated in the PCA step, I care that our ICA components reliably estimate as much of the total structured variance of the data as possible. This means, we can potentially give RobustICA an initial number of components that is too large such as min(1/3 of time points, 95% of total variance) and RobustICA will settle on a smaller plausible number.

I've been testing this out on a few datasets and it's promising, but not yet perfect. For example, I tested it on a dataset with 160 time points and RobustICA was set to run FastICA for 30 iterations

Initial components Calculated stable ICA components Index Quality
40 37 0.9483
45 42 0.9519
50 38 0.9464
55 43 0.9501
60 46 0.9456
65 45 0.9442
70 49 0.9448
75 48 0.9300
94 49 0.9287

Ideally, the calculated number of stable components should be the same if we initialize to 65 or 55. That's not happening, but it is reasonably stable. One other thing that’s noteworthy is that I’m running this with a block design flashing checkerboard task, which almost always results in a high variance component with the weights in primary visual cortex and a time series that follows the block design task. I am NOT seeing that when I initialized with 94 components & those final results were qualitatively noiser. This might be a sign that if one overestimates the intial number of components too much, the method breaks down. What might be happening is the algorithm is trying to successfully run FastICA 30 times. When it fails to converge it tries again. For 94 components, it failed 20 times (running FastICA a total of 50 times). For 70 components, it failed to converge only 8 times. In addition to being computationally expensive, this means trying to get ICA to converge with way too many components is resulting in stranger results when it does converge.

I'm going to keep playing with this and encourage others to do so as well. I'm also playing with more iterations of FastICA (the new n_robust_runs option in this PR) to see if that results in a more stable number of components across a wider range of inputs. This should definitely be a discussion at our January developers call. If anyone wants to actively look into this between now and then I'll find time to talk more and help you get started.

For reference, the manuscript describing this robust ICA approach is https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05043-9 In parallel, I asked @jefferykclam to work on the dimensionality estimation issue and he was converging on a similar approach based on this publication: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0094943

handwerkerd avatar Dec 22 '23 16:12 handwerkerd

Just wanted to jump in an officially register my excitement for this - I've checked this out and currently running on some trouble some data from neurostars. I think it will be enlightening.

I am also seeing a lot of failure to converge (~550 timepoints, req 180, ~65% var explained) and the phrase exponential back-off crossed my mind. Its not quite applicable here, but the central idea is if something fails, you don't just keep trying the same thing (back-off = more waiting). Here, at the risk of adding another layer of complexity (ugh) perhaps an option should be that too many failures (?) to converge should be logged, and the entire loop should be restarted with N few components? Not exponential, but maybe based on number of timepoints. If someone wild shows up with 1500 timepoints, and starts with 500, it might not be nice to step through 499, 498, etc. But If they have 200 and start with 66, might be smart to do 65, 64, etc.

In any case, I'm looking forward to seeing where this goes.

dowdlelt avatar Feb 20 '24 21:02 dowdlelt

One more quick comment - perhaps it is the data I'm looking through, but I'm seeing a bizzare effect - and I feel like it would have been noticed by now if it was general.

I have components that have vastly different timecourses, and yet the maps are just various scaled versions of each other - practically pixel identical, except magnitude. I know this image won't be easy to tell - but there are about ...I don't know 70/100 components like this - effectively same spatial map, only scale changes. image

This is neurostars data, so perhaps some violence was done to it, but I can't make sense of it. I'll try and poke around to see if I can learn something.

dowdlelt avatar Feb 21 '24 02:02 dowdlelt