mne-python icon indicating copy to clipboard operation
mne-python copied to clipboard

Default units in raw.to_data_frame()

Open chapochn opened this issue 1 year ago • 21 comments

Description of the problem

It is not a bug per se, but I feel it is a quite misleading default. Getting the data through "get_data" and "to_data_frame()" give different output values by default, because to_data_frame converts to a "standard" unit. I am not sure if it is intended? My suggestion would be to not have such behavior by default. Although maybe you want to_data_frame() to be coherent with plotting rather than "get_data". I would think it makes more sense to be coherent with "get_data". Thoughts?

Steps to reproduce

import numpy as np
import mne

n=1
X = np.ones((1, n))
chs = ['ch1']
info = mne.create_info(chs, 5, ch_types='eeg')
raw = mne.io.RawArray(X, info)

df = raw.to_data_frame(index='time')
data = raw.get_data()
print(df.values, data)

Link to data

No response

Expected results

I would have expected to get: [[1.]] [[1]]

Actual results

[[1000000.]] [[1.]]

I can get the same result for both by using: df = raw.to_data_frame(index='time', scalings=1)

Additional information

Core ├☑ mne 1.7.1 (latest release) ├☑ numpy 1.26.4 (OpenBLAS 0.3.27 with 12 threads) ├☑ scipy 1.14.0 └☑ matplotlib 3.8.4Backend MacOSX is interactive backend. Turning interactive mode on. (backend=MacOSX) Numerical (optional) ├☑ sklearn 1.5.1 ├☑ numba 0.60.0 ├☑ nibabel 5.2.1 ├☑ nilearn 0.10.4 ├☑ dipy 1.9.0 ├☑ openmeeg 2.5.11 ├☑ pandas 2.2.2 ├☑ h5io 0.2.3 ├☑ h5py 3.11.0 └☐ unavailable cupy Visualization (optional) ├☑ pyvista 0.43.10 (OpenGL 4.1 Metal - 88.1 via Apple M2 Pro) ├☑ pyvistaqt 0.11.1 objc[24995]: Class QT_ROOT_LEVEL_POOL__THESE_OBJECTS_WILL_BE_RELEASED_WHEN_QAPP_GOES_OUT_OF_SCOPE is implemented in both /Users/.../Programs/miniconda3/envs/mne/lib/libQt5Core.5.15.8.dylib (0x17f9e12e0) and /Users/.../Programs/miniconda3/envs/mne/lib/libQt6Core.6.7.2.dylib (0x3302bd558). One of the two will be used. Which one is undefined. objc[24995]: Class KeyValueObserver is implemented in both /Users/.../Programs/miniconda3/envs/mne/lib/libQt5Core.5.15.8.dylib (0x17f9e1308) and /Users/.../Programs/miniconda3/envs/mne/lib/libQt6Core.6.7.2.dylib (0x3302bd580). One of the two will be used. Which one is undefined. objc[24995]: Class RunLoopModeTracker is implemented in both /Users/.../Programs/miniconda3/envs/mne/lib/libQt5Core.5.15.8.dylib (0x17f9e1358) and /Users/.../Programs/miniconda3/envs/mne/lib/libQt6Core.6.7.2.dylib (0x3302bd5d0). One of the two will be used. Which one is undefined. ├☑ vtk 9.3.0 ├☑ qtpy 2.4.1 Process finished with exit code 139 (interrupted by signal 11:SIGSEGV)

(it crashes here)

chapochn avatar Jul 07 '24 02:07 chapochn

I think it was a very conscious decision to have to_data_frame return values as plotted and with numbers closer to 1...

but I agree it can be felt a bit too magic.

note that this has been like this for years and you're the first person to report that it's a problem. I would therefore not jump on the conclusion that this needs to be fixed.

Message ID: @.***>

agramfort avatar Jul 08 '24 07:07 agramfort

@chapochn Thanks for this suggestion!

I have to agree with @agramfort on this one, the purpose of the DataFrame export is different from numpy array extraction IMHO

@sappelhoff @cbrnr WDYT

hoechenberger avatar Jul 09 '24 07:07 hoechenberger

I also think that this is not a bug but intended behavior. It is also documented (https://mne.tools/stable/generated/mne.io.Raw.html#mne.io.Raw.to_data_frame), but maybe we could highlight this behavior more clearly at the top of the to_data_frame docstring or in a Notes section (instead of just in the description of the scalings parameter)?

cbrnr avatar Jul 09 '24 07:07 cbrnr

I agree with Alex, Richard, and Clemens. This behavior is as intended and IMHO also helpful/useful. Highlighting this behavior more clearly could be a valuable addition to the documentation.

sappelhoff avatar Jul 09 '24 08:07 sappelhoff

An added documentation to clarify the intention of this function sounds great. Otherwise, it wasn't really clear to me that it is not just a more convenient get_data function (I mainly work with panda DataFrames for my own analysis)

chapochn avatar Jul 09 '24 15:07 chapochn

@chapochn would you be interested in submitting a PR?

cbrnr avatar Jul 09 '24 16:07 cbrnr

I could write the intention, but TBH, I don't understand the intention... For me, it's just confusing to get another data than what is actually in the data... If it is needs to be plotted, plots can handle scaling the data, so I really don't see the necessity.

chapochn avatar Jul 09 '24 16:07 chapochn

The other confusing thing is that in scaling it says:

Scaling factor applied to the channels picked. If None, defaults to dict(eeg=1e6, mag=1e15, grad=1e13) — i.e., converts EEG to µV, magnetometers to fT, and gradiometers to fT/cm.

But it doesn't cover all the other types of data that are possible. I am using ecog, it's not clear what kind of scaling it does. What about ekg channels, misc, etc.

chapochn avatar Jul 09 '24 17:07 chapochn

I see these are all the units that are covered in Raw:

Proper units of measure: V: eeg, eog, seeg, dbs, emg, ecg, bio, ecog T: mag T/m: grad M: hbo, hbr Am: dipole AU: misc

So would be good to explicitly say what is the conversion for each unit. And if it is only based on the original unit, or also on the modality. I.e., are eeg, eog, seeg, dbs, emg, ecg, bio, ecog all the same conversion, or if it depends.

I believe these signals, although all in V, can have different orders of magnitude

chapochn avatar Jul 09 '24 20:07 chapochn

It's likely that the scaling values and related channel types pull from mne.defaults somewhere that we could document in our glossary if we don't already. They are probably used elsewhere anyway.

larsoner avatar Jul 09 '24 20:07 larsoner

Yes, it's here:

The set of channels considered “data channels” in MNE contains the following types (together with scale factors for plotting)

https://mne.tools/stable/documentation/glossary.html#term-data-channels and here:

The set of channels considered “non-data channels” in MNE contains the following types (together with scale factors for plotting):

https://mne.tools/stable/documentation/glossary.html#term-non-data-channels

Should probably then add "and for DataFrame export"

And it could be cross-linked to the places where these scalings are used.

chapochn avatar Jul 09 '24 20:07 chapochn

I would like to pick this issue. Can you please assign it to me?

jadu2002 avatar Dec 08 '24 01:12 jadu2002

@jadu2002 nobody else is working on it currently, so yes please open a pull request when you are ready for feedback!

drammock avatar Dec 08 '24 02:12 drammock

we can add a new parameter, native_units to the to_data_frame method. When native_units=True, (to_data_frame) will return the raw data in native units without applying any scaling. this would be consistent with get_data and we keep default of native_units=false,which keeps the behaviour intact. Am i going right?

b423016 avatar Dec 12 '24 14:12 b423016

@b423016 The consensus from several maintainers' comments above is "behavior is intended, and improved documentation is the preferred solution". Adding a new parameter like si_units=False or scalings="auto" | None | dict would certainly be a way to provide the option to users... but as @agramfort said above

this has been like this for years and you're the first person to report that it's a problem. I would therefore not jump on the conclusion that this needs to be fixed.

so @b423016 are you also wishing for this capability, or merely suggesting a fix?

drammock avatar Dec 12 '24 16:12 drammock

Adding a new parameter like si_units=False or scalings="auto" | None | dict would certainly be a way to provide the option to users...

We don't need to add a new parameter because scalings already exists, and if you do not want to rescale the data then you can set it to 1. As far as I understand the issue, this is about the default.

cbrnr avatar Dec 12 '24 18:12 cbrnr

We don't need to add a new parameter because scalings already exists

😳 oops.

drammock avatar Dec 12 '24 19:12 drammock

well as @cbrnr mentioned I guess it would be better just to improve the docs instead of try fixing or adding parameters

@b423016 The consensus from several maintainers' comments above is "behavior is intended, and improved documentation is the preferred solution". Adding a new parameter like si_units=False or scalings="auto" | None | dict would certainly be a way to provide the option to users... but as @agramfort said above

this has been like this for years and you're the first person to report that it's a problem. I would therefore not jump on the conclusion that this needs to be fixed.

so @b423016 are you also wishing for this capability, or merely suggesting a fix?

b423016 avatar Dec 12 '24 19:12 b423016

I agree that clarifying the scaling behavior explicitly in the documentation would be helpful, especially for users working with less common channel types like ECOG, EKG, or MISC.

Would it make sense to add a "Notes" section in the to_data_frame() docstring to highlight this behavior and cross-link it to the glossary where the channel types and their default scaling values are already listed? This would ensure that users are aware of the conversions applied during DataFrame export.

PoojithaYelkur avatar Mar 26 '25 12:03 PoojithaYelkur

Would it make sense to add a "Notes" section in the to_data_frame() docstring ...

Sure, that sounds helpful!

larsoner avatar Mar 29 '25 18:03 larsoner

Hi! I’m a newcomer interested in contributing. May I please work on this issue?

Himasri17 avatar Apr 05 '25 15:04 Himasri17