NeuroKit Workflows for outlier removal

Workflows for outlier removal

Open shanelindsay opened this issue 2 years ago • 8 comments

I am working with respiration data right now, but I think is a more general question.

I am unclear on the workflows for outlier detection and removal.

One way to remove outliers (e.g. movement artefacts with spikes in data much outside normal range) is manual inspection and either removal of periods of bad data or adjustment of events (e.g. remove/change detected peaks). Some tools use a gui (e.g. see eeglab to facilitate this which I imagine would be on a "nice to have" list for future neurokit.

Alternatively you could use some code for outlier detection/removal.

Either way, I am a bit unclear on what is expected in the neurokit way of doing things. Arrays aren't indexed by time (instead by sample) so removal of periods of bad data would mess up timing. Setting bad data to NA would also seem like it would mess up a lot of algorithms and get undesired results (also gets tricky with interval/epochs).

Other tools have more complicated data structures that incorporate removed data e.g. eeglab "EEGLAB adjust the EEG.event structure fields and insert boundary events where data has been rejected, with a duration field holding the duration of the data portion that was rejected".

But for neurokit, I was curious on the recommended current workflows for dealing with outliers and any future plans.

Sep 08 '21 13:09 shanelindsay

Thanks again Shane for sharing some ideas. I think there are essentially two different types of "bad data":

One would be "transcient artefacts", resulting in sudden spikes that could be mistakenly detected as peaks. This, primarily, would be addressed by filtering, peak fixing, etc. It's true that GUIs can be helpful in this case to manually select / deselect artifacts, but manual correction has not really been the direction of the package. I'm not exactly sure how to give to users more control to manually adjust the detected peaks...

The second class of bad data would be indeed bad periods of data, i.e., where data is simply too noisy for some time due to disconnection, movement, or whatnot. For this we don't really have any tools in NK, and I can foresee that it's quite a hassle to deal with as you mentioned: if we turn to NaNs all the values of the periods, then we would need to run the processing on all the periods of good data separately, which then leads to other issues. I think so far we've swept this issue under the rug, but it is true that periods of bad data are fairly common. The list of things to clarify API-wise would be:

How to detect these periods of bad data? Automatically (periods of change)? Manually (GUI, input list of tuples of beginning-ends times / samples)?
At what point do users have the option to input these values? Signal cleaning?
How to deal with them? Turn to NaNs? Super strong filtering?
Need to check all the functions that work / would need adjustements for the presence of periods of NaNs or bad data. We could start with a more simple signal like RSP.

Sep 09 '21 00:09 DominiqueMakowski

This issue has been automatically marked as inactive because it has not had recent activity. It will eventually be closed if no further activity occurs.

Mar 09 '22 00:03 stale[bot]

I've been working on adapting the HRV feature functions for bad periods of data:

To access the timestamps of the interbeat intervals, I modified the HRV functions to accept both interbeat intervals and their times (rather than using intervals_to_peaks), since the recordings I was using were manually cleaned using a different software (rather than within the hrv functions)
For the frequency-domain features, I always interpolated over the missing times, since the interbeat intervals are interpolated in this fuction anyway, but for some of the time-domain and non-linear features, interpolation at a fixed value did not seem to work well when comparing to the original feature extracted from the same clean data without missing values and without interpolation
For features that rely on successive differences between intervals, missing intervals result in "non-successive" differences between the intervals directly before and after the missing data. Removing these non-successive differences from the Poincaré plot axes seemed to work well for certain features e.g. SD1a, but not for others e.g. GI, presumably since asymmetry indices are more sensitive to removal of accelerations vs. decelerations while short-term variability indices are more sensitive to inflated successive difference values...but I don't have any literature on this. In case helpful, here's a notebook where I compare the effect of different methods of treating missing data on these different HRV features: https://github.com/danibene/NeuroKit/blob/feature/adjust_for_missing_intervals/docs/examples/outliers_rri.ipynb

If anyone has suggestions for literature or things to try, they would be very appreciated :-)

Apr 13 '22 10:04 danibene

Missed this too, what's the status here?

Sep 21 '22 00:09 DominiqueMakowski

@DominiqueMakowski

The changes I made in my branch allow for input with missing RRIs: https://github.com/danibene/NeuroKit/tree/feature/adjust_for_missing_intervals The issue is that I am unsure if I dealt with missing data in the most ideal way for all the features, in particular those in hrv_nonlinear(). The other adjustments, to hrv_time() and hrv_frequency() that I am more confident in were based on just what made the most sense to me rather than the literature (e.g. only using successive interbeat intervals to calculate RMSSD & pNN50). So my questions for you would be:

Are you interested in including the adjustments to only hrv_time() and hrv_frequency()?

If so, we could have an optional argument e.g. adjust_for_missing with default None but with the option to set to e.g. neurokit... Then there would be a warning that the neurokit method of adjust_for_missing has not been implemented for hrv_nonlinear().

Do you want these adjustments to be empirically-backed?

I was thinking one way to evaluate these methods could be by comparing the correlation between features extracted from original full segments vs. segments with simulated missing data in a publicly available data set e.g. ECG-GUDB...but doing this kind of analysis would become a larger project so I'm not sure when I would get to it.

Sep 21 '22 01:09 danibene

Isn't a median/mean NA imputation a good-enough approach for nonlinear?

For hrv_time and hrv_frequency if the changes are safe I think we can have them in by default (we don't need an argument to "turn them off" - if people want to do something else they can deal with their missing before hand)

Sep 21 '22 01:09 DominiqueMakowski

Isn't a median/mean NA imputation a good-enough approach for nonlinear?

@DominiqueMakowski I don't think this could be applied as a blanket rule to all the nonlinear features, e.g. for SD1 we would want to make the same kind of adjustments as to RMSSD, removing differences between intervals that are not successive. Also wouldn't imputation distort the time-embeddings used in the complexity features? But I suppose it's better than doing nothing and just having a warning that it's not yet implemented...

For hrv_time and hrv_frequency if the changes are safe I think we can have them in by default (we don't need an argument to "turn them off" - if people want to do something else they can deal with their missing before hand)

Fair, I was thinking that we might want to eventually implement multiple methods in order to compare them but I guess we could cross that bridge when we get to it

Sep 21 '22 01:09 danibene

Yes it would probably distort the time delay embedding... we can still throw a warning, but at least this way we don't error and frustrate the user

Sep 21 '22 02:09 DominiqueMakowski

For hrv_time and hrv_frequency if the changes are safe I think we can have them in by default (we don't need an argument to "turn them off" - if people want to do something else they can deal with their missing before hand)

I realized that if the interbeat intervals are detrended, I would not be able to use their corresponding timestamps to detect missing/non-consecutive data. Thoughts on having an additional optional key in the input dictionary (e.g. "RRI", "RRI_Time", "missing_indicator") in order to specify if the timestamps should be used to check for missing values? Alternatively/additionally I could just assume that the timestamps are not an indicator of missing data if a certain proportion of them can't be matched with the intervals?

Sep 29 '22 16:09 danibene

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

Dec 03 '22 06:12 stale[bot]

NeuroKit NeuroKit copied to clipboard

Workflows for outlier removal

NeuroKit
NeuroKit copied to clipboard