moabb icon indicating copy to clipboard operation
moabb copied to clipboard

Sample chronologically per default

Open jsosulski opened this issue 4 years ago • 4 comments
trafficstars

Currently, moabb stratifies and picks random samples from X for training (T)/validation (V) respectively. A simple assignment vector could look like this:

[T,T,T,V,T,V,V,T,T,V,T,V,T,T]

Especially in ERP datasets, when successive epochs are overlapping, this way of sampling will lead to statistical information (e.g. signal mean and signal covariance) leaking between train and test epochs. There are two ways to mitigate this issue:

  1. Do the train / validation assignment not on individual labels but on run-level, e.g., a sequence of N flashes that were done in the underlying P300 speller experiment.
  2. If run information is not available, we could either infer it from marker timings, or simply use contiguous folds, e.g.:
[T,T,T,T,T,T,T,T,T,V,V,V,V,V]

Then we would have a possible leak from train to validation epochs at only one intersection between train / validation label assignments.

Doing this will probably reduce average performances achieved with all classifier types, however, less so for classifiers that have better generalization properties.

jsosulski avatar May 05 '21 11:05 jsosulski

Yes, this is a step toward evaluations that are closer to real use of BCI, with online setting. It could be an issue for within-session evaluation, but there is no problem for cross-session/cross-subject evaluation. Within-session evaluation relies on 5-fold cross-validation, are you suggesting to switch to a single train-test split or to a something like a group k-fold?

sylvchev avatar May 05 '21 14:05 sylvchev

GroupKFold would be the ideal when we have meaningful segmentation between parts of the experiment, e.g., one stimulation sequence to detect a new letter. However, in the absence of this information for each dataset - and If I understand StratifiedKFold correctly - we could also just set shuffle=False in the crossvalidation call?

jsosulski avatar May 05 '21 14:05 jsosulski

I just found out about TimeSeriesSplit which sounds interesting, as it both allows plotting of kind of a learning curve as well as reflect an online BCI setup. See this plot from the sklearn docs:

grafik

jsosulski avatar Jul 08 '21 10:07 jsosulski