mne-python
mne-python copied to clipboard
edf export changes channel values
Description of the problem
Taking an edf file, subsetting the channels, and then exporting it changes channel values
Steps to reproduce
import mne.io
# assume you have a local copy of some TUH data
edffile = mne.io.read_raw_edf("aaaaaefp_s007_t001.edf", verbose=False, preload=True)
edffile.pick(["EEG FP1-REF", "EEG FP2-REF", "EEG F3-REF"])
edffile.export("new.edf")
subchannels = mne.io.read_raw_edf("new.edf", verbose=False, preload=True)
print((edffile.to_data_frame() - subchannels.to_data_frame()).abs().sum())
Link to data
No response
Expected results
zeros everywhere
Actual results
nonzero valus
Additional information
mne.sys_info() Platform Linux-6.13.9-100.fc40.x86_64-x86_64-with-glibc2.39 Python 3.12.9 (main, Feb 4 2025, 00:00:00) [GCC 14.2.1 20240912 (Red Hat 14.2.1-3)] Executable /home/yaxxie/.cache/pypoetry/virtualenvs/lulzi-pork-model-LOV6wgKF-py3.12/bin/python CPU Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz (12 cores) Memory 31.2 GiB
Core ├☑ mne 1.9.0 (latest release) ├☑ numpy 1.26.4 (OpenBLAS 0.3.23.dev with 12 threads) ├☑ scipy 1.15.2 └☑ matplotlib 3.10.1 (backend=tkagg)
Numerical (optional) ├☑ sklearn 1.6.1 ├☑ numba 0.61.2 ├☑ pandas 2.2.3 └☐ unavailable nibabel, nilearn, dipy, openmeeg, cupy, h5io, h5py
Visualization (optional) └☐ unavailable pyvista, pyvistaqt, vtk, qtpy, ipympl, pyqtgraph, mne-qt-browser, ipywidgets, trame_client, trame_server, trame_vtk, trame_vuetify
Ecosystem (optional) ├☑ edfio 0.4.8 └☐ unavailable mne-bids, mne-nirs, mne-features, mne-connectivity, mne-icalabel, mne-bids-pipeline, neo, eeglabio, mffpy, pybv
Hello! 👋 Thanks for opening your first issue here! ❤️ We will try to get back to you soon. 🚴
I would expect some potential deviation due to having to format the data properly, but these errors should be small. Can you use something like np.testing.assert_allclose(edffile.get_data(), subchannels.get_data(), atol=1e-14, rtol=1e-6) and see what that does? The output will be more meaningful/useful.
Also if you could share a problematic file that would help
I would expect some potential deviation due to having to format the data properly
Can you elaborate upon this? I'm picking only a subset of channels, I don't see a reason for channel values changing as a consequence of any formatting. if the underlying digital samples are say [1, 2, 3] for a selected channel, they should still be [1,2,3], no?
Also if you could share a problematic file that would help
I gave an example using a file from TUH EEG corpus, assuming most people in this space work with this. I don't believe the terms of obtaining this data permit redistribution by myself.
np.testing.assert_allclose(edffile.get_data(), subchannels.get_data(), atol=1e-14, rtol=1e-6)
This would flag, the relative average per-sample deviation was something like 0.07
I gave an example using a file from TUH EEG corpus, assuming most people in this space work with this.
There are thousands of EEG/MEG datasets. not everyone is familiar with every dataset. Could you at least provide a link to the source? A direct link to a specific data file is preferred, but if the repository doesn't allow that, tell us the subject / session / condition / etc please.
Sure, you could find this particular file here
hmm, that site requires login. Looks like to get credentials one must fill in a form, email that form to the maintainers, and wait for a response. https://isip.piconepress.com/projects/nedc/html/tuh_eeg/
The issue can be reproduced with this file
import mne.io
# assume you have a local copy of some CHB-MIT Scalp EEG data
edffile = mne.io.read_raw_edf("chb01_01.edf", verbose=False, preload=True)
edffile.pick(["FP1-F7", "F7-T7", "T7-P7"])
edffile.export("new.edf")
subchannels = mne.io.read_raw_edf("new.edf", verbose=False, preload=True)
print((edffile.to_data_frame() - subchannels.to_data_frame()).abs().mean())
thanks for the file. I can reproduce. So far I've traced things through the importer and am not seeing any obvious culprits; haven't looked at the exporter yet. The bug doesn't actually require subselecting channels; this will fail:
import numpy as np
import mne
original = mne.io.read_raw_edf("chb01_01.edf", verbose=False, preload=True)
original.export("new.edf", overwrite=True)
exported = mne.io.read_raw_edf("new.edf", verbose=False, preload=True)
np.testing.assert_allclose(
original.get_data(), exported.get_data(), atol=1e-14, rtol=1e-6
)
Given that ~99% of elements mismatch, I'm guessing this is probably an int -> float -> int problem, or possibly mis-setting the digital/physical min/max in the exported file. But that's just a wild guess.
Thanks for the comments. I dug a little more myself and suspect it is edfio; when I re-export using pyedflib most of the values come back with 0 differences (but a few samples it seems like the get nudged by what looks like +/-1 in the digital representation); it made me curious as to whether there's any reason that edfio is the exporter lib within mne rather than pyedflib, given pyedflib is a dependency anyway?
with the provided file, we have this during read (omitting the array in cases where it's an array of identical values for all channels):
- cal = 0.39072039
- data_offset = 6144
- digital_max = 2047.
- dtype_np = '<i2'
- offsets = 0.1953602
- physical_max = 800.
- units = 1.e-06
- first 8 values of first channel:
-374, 0, 0, 0, 0, 2, 1, -5
We then do (data * cal + offset) * units to get -1.45934066e-04, 1.95360195e-07, 1.95360195e-07, 1.95360195e-07, 1.95360195e-07, 9.76800977e-07, 5.86080586e-07, -1.75824176e-06
When writing, we have
- digital_max = 32767
- physical_max = 1037.9487179487178
- signal.data =
array([-145.94307227, 0.19941197, 0.19941197, 0.19941197, 0.19941197, 0.98769738, 0.59355467, -1.77130155])(after scaling back tounitsAKA uV) - signal.digital =
array([-9285, -4094, -4094, -4094, -4094, -4066, -4080, -4164], dtype=int16)
A couple of things to note:
- original file seems to be 12-bit ints? (digital_max = 2047)
- the
data * cal + offsetcomputation that happens on read, is not being reversed on write
I dug a little more myself and suspect it is edfio
To me it seems the bug is in our code that prepares the data values for EDFio; IIUC we're just not accounting for cal and offset.
it made me curious as to whether there's any reason that edfio is the exporter lib within mne rather than pyedflib, given pyedflib is a dependency anyway?
pyedflib is not a dependency of MNE-Python. The only result:
$ git grep --ignore-case pyedflib
tutorials/io/20_reading_eeg_data.py:66:`pyedflib <https://github.com/holgern/pyedflib>`_ under the hood) can be used
points to an out-of-date statement about how MNELAB works.
pyedflib is not a dependency of MNE-Python.
You're right, my mistake. it is a transitive dependency of another package in my project.
Thanks for the detailed write-up!
@hofaflo do you have time to look? I'm not sure how urgent/serious this problem is, i.e., if it affects all exported files or just some very specific cases...
The original file uses the following calibration values:
- physical range: [-800, 800]
- digital range: [-2048, 2047]
whereas the export uses these:
- physical range: [-882.443, 1037.949] (the actual physical min/max over all signals)
- digital range: [-32767, 32767] (almost the maximum range possible in EDF, but centered around 0)
This causes small differences in signal values, but the maximum deviation (0.014648) is well below the resolution of the original file ($(800+800)/(2047+2048)=0.39072$) and - as expected - slightly below half the resolution of the exported one ($(1037.949+882.443)/(32767+32767)=0.0293$).
To me it seems the bug is in our code that prepares the data values for EDFio; IIUC we're just not accounting for
calandoffset.
This is not an issue, mne passes the digital and physical ranges to edfio, which takes care of that here
Thanks @hofaflo! As a follow-up question, what is the reason why someone would not use the entire 16 bits of resolution, but instead e.g. only 12 bits as in this file?
A reason could be to keep consistency with the device that recorded the signals, e.g. if a 12-bit ADC was used. Not sure if this applies to this concrete file though, as the database page mentions that "All signals were sampled at 256 samples per second with 16-bit resolution.". Also, the file actually contains data exceeding the specified digital range (which is not strictly forbidden by the EDF standard, as the physical and digital ranges are only required to allow calculating the correct gain and offset, where specifying (-800, 800) and (-2048, 2047) produces the same results as (-12800, 12800) and (-32767, 32768)).
A more common reason to use a custom digital range is to store integer-valued signals without loss, e.g. SpO2, trigger, or hypnogram signals.
the file actually contains data exceeding the specified digital range (which is not strictly forbidden by the EDF standard, as the physical and digital ranges are only required to allow calculating the correct gain and offset,
I don't agree. The EDF spec says:
"The digital minimum and maximum of each signal should specify the extreme values that can occur in the data records."