mne-python edf export changes channel values

Description of the problem

Taking an edf file, subsetting the channels, and then exporting it changes channel values

Steps to reproduce

import mne.io

# assume you have a local copy of some TUH data
edffile = mne.io.read_raw_edf("aaaaaefp_s007_t001.edf", verbose=False, preload=True)
edffile.pick(["EEG FP1-REF", "EEG FP2-REF", "EEG F3-REF"])

edffile.export("new.edf")

subchannels = mne.io.read_raw_edf("new.edf", verbose=False, preload=True)

print((edffile.to_data_frame() - subchannels.to_data_frame()).abs().sum())

Link to data

No response

Expected results

zeros everywhere

Actual results

nonzero valus

Additional information

mne.sys_info() Platform Linux-6.13.9-100.fc40.x86_64-x86_64-with-glibc2.39 Python 3.12.9 (main, Feb 4 2025, 00:00:00) [GCC 14.2.1 20240912 (Red Hat 14.2.1-3)] Executable /home/yaxxie/.cache/pypoetry/virtualenvs/lulzi-pork-model-LOV6wgKF-py3.12/bin/python CPU Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz (12 cores) Memory 31.2 GiB

Core ├☑ mne 1.9.0 (latest release) ├☑ numpy 1.26.4 (OpenBLAS 0.3.23.dev with 12 threads) ├☑ scipy 1.15.2 └☑ matplotlib 3.10.1 (backend=tkagg)

Numerical (optional) ├☑ sklearn 1.6.1 ├☑ numba 0.61.2 ├☑ pandas 2.2.3 └☐ unavailable nibabel, nilearn, dipy, openmeeg, cupy, h5io, h5py

Visualization (optional) └☐ unavailable pyvista, pyvistaqt, vtk, qtpy, ipympl, pyqtgraph, mne-qt-browser, ipywidgets, trame_client, trame_server, trame_vtk, trame_vuetify

Ecosystem (optional) ├☑ edfio 0.4.8 └☐ unavailable mne-bids, mne-nirs, mne-features, mne-connectivity, mne-icalabel, mne-bids-pipeline, neo, eeglabio, mffpy, pybv

Apr 10 '25 08:04 yaxxie-piramidal

Hello! 👋 Thanks for opening your first issue here! ❤️ We will try to get back to you soon. 🚴

Apr 10 '25 08:04 welcome[bot]

I would expect some potential deviation due to having to format the data properly, but these errors should be small. Can you use something like np.testing.assert_allclose(edffile.get_data(), subchannels.get_data(), atol=1e-14, rtol=1e-6) and see what that does? The output will be more meaningful/useful.

Also if you could share a problematic file that would help

Apr 10 '25 17:04 larsoner

I would expect some potential deviation due to having to format the data properly

Can you elaborate upon this? I'm picking only a subset of channels, I don't see a reason for channel values changing as a consequence of any formatting. if the underlying digital samples are say [1, 2, 3] for a selected channel, they should still be [1,2,3], no?

Also if you could share a problematic file that would help

I gave an example using a file from TUH EEG corpus, assuming most people in this space work with this. I don't believe the terms of obtaining this data permit redistribution by myself.

np.testing.assert_allclose(edffile.get_data(), subchannels.get_data(), atol=1e-14, rtol=1e-6)

This would flag, the relative average per-sample deviation was something like 0.07

Apr 10 '25 20:04 yaxxie-piramidal

I gave an example using a file from TUH EEG corpus, assuming most people in this space work with this.

There are thousands of EEG/MEG datasets. not everyone is familiar with every dataset. Could you at least provide a link to the source? A direct link to a specific data file is preferred, but if the repository doesn't allow that, tell us the subject / session / condition / etc please.

Apr 10 '25 20:04 drammock

Sure, you could find this particular file here

Apr 10 '25 20:04 yaxxie-piramidal

hmm, that site requires login. Looks like to get credentials one must fill in a form, email that form to the maintainers, and wait for a response. https://isip.piconepress.com/projects/nedc/html/tuh_eeg/

Apr 10 '25 21:04 drammock

The issue can be reproduced with this file


import mne.io

# assume you have a local copy of some CHB-MIT Scalp EEG data
edffile = mne.io.read_raw_edf("chb01_01.edf", verbose=False, preload=True)
edffile.pick(["FP1-F7", "F7-T7", "T7-P7"])

edffile.export("new.edf")

subchannels = mne.io.read_raw_edf("new.edf", verbose=False, preload=True)

print((edffile.to_data_frame() - subchannels.to_data_frame()).abs().mean())

Apr 10 '25 23:04 yaxxie-piramidal

thanks for the file. I can reproduce. So far I've traced things through the importer and am not seeing any obvious culprits; haven't looked at the exporter yet. The bug doesn't actually require subselecting channels; this will fail:

import numpy as np

import mne

original = mne.io.read_raw_edf("chb01_01.edf", verbose=False, preload=True)
original.export("new.edf", overwrite=True)
exported = mne.io.read_raw_edf("new.edf", verbose=False, preload=True)
np.testing.assert_allclose(
    original.get_data(), exported.get_data(), atol=1e-14, rtol=1e-6
)

Given that ~99% of elements mismatch, I'm guessing this is probably an int -> float -> int problem, or possibly mis-setting the digital/physical min/max in the exported file. But that's just a wild guess.

Apr 14 '25 22:04 drammock

Thanks for the comments. I dug a little more myself and suspect it is edfio; when I re-export using pyedflib most of the values come back with 0 differences (but a few samples it seems like the get nudged by what looks like +/-1 in the digital representation); it made me curious as to whether there's any reason that edfio is the exporter lib within mne rather than pyedflib, given pyedflib is a dependency anyway?

Apr 15 '25 07:04 yaxxie-piramidal

with the provided file, we have this during read (omitting the array in cases where it's an array of identical values for all channels):

cal = 0.39072039
data_offset = 6144
digital_max = 2047.
dtype_np = '<i2'
offsets = 0.1953602
physical_max = 800.
units = 1.e-06
first 8 values of first channel: -374, 0, 0, 0, 0, 2, 1, -5

We then do (data * cal + offset) * units to get -1.45934066e-04, 1.95360195e-07, 1.95360195e-07, 1.95360195e-07, 1.95360195e-07, 9.76800977e-07, 5.86080586e-07, -1.75824176e-06

When writing, we have

digital_max = 32767
physical_max = 1037.9487179487178
signal.data = array([-145.94307227, 0.19941197, 0.19941197, 0.19941197, 0.19941197, 0.98769738, 0.59355467, -1.77130155]) (after scaling back to units AKA uV)
signal.digital = array([-9285, -4094, -4094, -4094, -4094, -4066, -4080, -4164], dtype=int16)

A couple of things to note:

original file seems to be 12-bit ints? (digital_max = 2047)
the data * cal + offset computation that happens on read, is not being reversed on write

I dug a little more myself and suspect it is edfio

To me it seems the bug is in our code that prepares the data values for EDFio; IIUC we're just not accounting for cal and offset.

it made me curious as to whether there's any reason that edfio is the exporter lib within mne rather than pyedflib, given pyedflib is a dependency anyway?

pyedflib is not a dependency of MNE-Python. The only result:

$ git grep --ignore-case pyedflib
tutorials/io/20_reading_eeg_data.py:66:`pyedflib <https://github.com/holgern/pyedflib>`_ under the hood) can be used

points to an out-of-date statement about how MNELAB works.

Apr 15 '25 17:04 drammock

pyedflib is not a dependency of MNE-Python.

You're right, my mistake. it is a transitive dependency of another package in my project.

Thanks for the detailed write-up!

Apr 15 '25 21:04 yaxxie-piramidal

@hofaflo do you have time to look? I'm not sure how urgent/serious this problem is, i.e., if it affects all exported files or just some very specific cases...

May 09 '25 14:05 cbrnr

The original file uses the following calibration values:

physical range: [-800, 800]
digital range: [-2048, 2047]

whereas the export uses these:

physical range: [-882.443, 1037.949] (the actual physical min/max over all signals)
digital range: [-32767, 32767] (almost the maximum range possible in EDF, but centered around 0)

This causes small differences in signal values, but the maximum deviation (0.014648) is well below the resolution of the original file ($(800+800)/(2047+2048)=0.39072$) and - as expected - slightly below half the resolution of the exported one ($(1037.949+882.443)/(32767+32767)=0.0293$).

To me it seems the bug is in our code that prepares the data values for EDFio; IIUC we're just not accounting for cal and offset.

This is not an issue, mne passes the digital and physical ranges to edfio, which takes care of that here

May 11 '25 19:05 hofaflo

Thanks @hofaflo! As a follow-up question, what is the reason why someone would not use the entire 16 bits of resolution, but instead e.g. only 12 bits as in this file?

May 12 '25 04:05 cbrnr

A reason could be to keep consistency with the device that recorded the signals, e.g. if a 12-bit ADC was used. Not sure if this applies to this concrete file though, as the database page mentions that "All signals were sampled at 256 samples per second with 16-bit resolution.". Also, the file actually contains data exceeding the specified digital range (which is not strictly forbidden by the EDF standard, as the physical and digital ranges are only required to allow calculating the correct gain and offset, where specifying (-800, 800) and (-2048, 2047) produces the same results as (-12800, 12800) and (-32767, 32768)).

A more common reason to use a custom digital range is to store integer-valued signals without loss, e.g. SpO2, trigger, or hypnogram signals.

May 12 '25 07:05 hofaflo

the file actually contains data exceeding the specified digital range (which is not strictly forbidden by the EDF standard, as the physical and digital ranges are only required to allow calculating the correct gain and offset,

I don't agree. The EDF spec says:

"The digital minimum and maximum of each signal should specify the extreme values that can occur in the data records."

May 14 '25 19:05 Teuniz

mne-python mne-python copied to clipboard

edf export changes channel values

Description of the problem

Steps to reproduce

Link to data

Expected results

Actual results

Additional information

mne-python
mne-python copied to clipboard