mne-bids [BUG] unintuitive behavior with split files

Description of the problem

Hi,

I think there are some problems with the way mne-bids handles split files, potentially in combination with symlinks. This report is based on this discussion on the forum.

Suppose you have split recordings and try to read them. There are following scenarios:

You're filepath is a symlink to the actual file. You specify the split field (as you should), everything seems to work fine.
You're filepath is a symlink to the actual file. You don't specify the split field (perhaps only a few of your recordings are large enough to be split), mne-bids, doesn't complain and initializes the read process. However, further down the road, mne-python throws an error, because it can't find the second part of the split file.
You're filepath directly lead to the actual file. You specify the split field (as you should), everything seems to work fine.
You're filepath directly lead to the actual file. You do not specify the split field, everything seems to work fine as well.

As you see there are two things going on here:

mne-bids/mne-python finds a split file even if no split recording was specified.
it fails to do that, if the file is a symlink.

The reason for the failure in 2. is that mne-bids resolves the symlink already, such that when mne-python later checks the next_fname field, it gets confused because the specified next_fname is base on the symlink and not that underlying file.

Apart from that, I have doubts whether it is desirable to silently read a file that wasn't requested. Perhaps that should be prevented in first place?

In any case, I think the current situation is a bit confusing. It shouldn't matter on whether a filename points to a file or a symlink whether the split field is required or not. I see two options:

Don't resolve symlinks in mne-bids (mne-python does that anyway), making setting the split field fully optional
Throw an error/warning in mne-bids whenever the split field "is forgotten"

As mentioned on the forum, it seems removing the .resolve() is sufficient for (1) https://github.com/mne-tools/mne-bids/blob/64f95f84677a8b21b39e359d9dd12732a1e8dcec/mne_bids/read.py#L690

Steps to reproduce

# I tested it out private files. 
# If you want I can try to generate a complete MWE with the sample datasetset (next week)

# split specified, path to symlink
path = mb.BIDSPath(..., split=split)
raw = mb.read_raw_bids(path) # works

# split not specified, path to symlink
path2 = mb.BIDSPath(...)
raw2 = mb.read_raw_bids(path2) # fails

# split specified, path to actual file
path3 = mb.BIDSPath(..., split=split)
raw3 = mb.read_raw_bids(path) # works

# split not specified, path to actual file
path4 = mb.BIDSPath(...)
raw4 = mb.read_raw_bids(path2) # works

Expected results

unclear

Actual results

Actual files are read regardless of whether split is specified. Symlinks cannot be read if split is not specified.

Additional information

Running mne-bids 0.11.dev0 and mne-python 1.2.dev0

Sep 02 '22 13:09 eort

Yes please share a sample dataset we can use to test and better understand the issue

Sep 02 '22 13:09 agramfort

Here you go. Most of it has the purpose to generate the file structure necessary to reproduce it. I stumbled over this problem, because I use datalad which uses git-annex to store data. Therefore the original files have some hash-based file path, with a human readable symlink in the appropriate bids directory that points to the originals. I think the randomness of the hash path causes the problem, as finding the next split is not trivial.

So if you have datalad already installed, you can skip most of the code by simply creating a dataset from mne-sample-bids dataset and then to read the data in the different scenarios.

Let me know if anything is unclear


import mne
import mne_bids as mb
import os
import os.path as op
from mne import datasets


# first we need to create all sets of files
# one pair of split files that are real files
# and another  pair of split files that are symlinks

# load the sample data
out_path = op.join('temp', 'sub-01', 'ses-01', 'meg')

os.makedirs(out_path, exist_ok=True)
# we need to more directories where we will hide the original symlinks in later
os.makedirs('temp/foo1', exist_ok=True)
os.makedirs('temp/foo2', exist_ok=True)

raw_dir = op.join(datasets.sample.data_path(), 'MEG', 'sample')
raw = mne.io.read_raw(op.join(raw_dir, 'sample_audvis_raw.fif'))

# define the paths for the files to be written
out_file = op.join(out_path, 'sub-01_ses-01_task-real_run-01_meg.fif')
out_symlink = op.abspath(op.join(out_path, 
                                 'sub-01_ses-01_task-sym_run-01_meg.fif'))

# crop the raw and save them
raw.crop(0, 10)
raw.save(out_file, split_size="10MB", split_naming='bids')
raw.save(out_symlink, split_size="10MB", split_naming='bids')

# adjust paths to match the splits
split1_link_src = out_symlink.replace('_meg', '_split-01_meg')
split2_link_src = out_symlink.replace('_meg', '_split-02_meg')
split1_link_tgt = split1_link_src.replace('sym', 'link')
split2_link_tgt = split2_link_src.replace('sym', 'link')

# move the symlink files somewhere non-bids conforming
# I think it is important that the splits are in different directories
split1_link_src_new = split1_link_src.replace('sub-01/ses-01/meg', 'foo1')
split2_link_src_new = split2_link_src.replace('sub-01/ses-01/meg', 'foo2')
split1_link_tgt = split1_link_tgt.replace('link', 'sym')
split2_link_tgt = split2_link_tgt.replace('link', 'sym')

# mv the original files out of the meg directory
os.system(f'mv {split1_link_src} {split1_link_src_new}')
os.system(f'mv {split2_link_src} {split2_link_src_new}')

# generate symlinks
os.symlink(split1_link_src_new, split1_link_tgt)
os.symlink(split2_link_src_new, split2_link_tgt)

# the actual exercise: trying to read data
sub = '01'
ses = '01'
datatype ='meg'
run = '01'
root = 'temp'
split='01'

# path to symlink
task = 'sym'
# split not specified
path = mb.BIDSPath(root=root, subject=sub, session=ses, task=task,
                    datatype=datatype, run=run)
raw = mb.read_raw_bids(path) # fails

# split specified
path2 = mb.BIDSPath(root=root, subject=sub, session=ses, task=task,
                    datatype=datatype, run=run, split=split)
raw2 = mb.read_raw_bids(path2) # works

# path to real files
task= 'real'
# split not specified
path3 = mb.BIDSPath(root=root, subject=sub, session=ses, task=task,
                    datatype=datatype, run=run)
raw3 = mb.read_raw_bids(path3) # fails

# split specified
path4 = mb.BIDSPath(root=root, subject=sub, session=ses, task=task,
                    datatype=datatype, run=run, split=split)
raw4 = mb.read_raw_bids(path4) # fails

os.system("rm -r temp")

Sep 03 '22 09:09 eort

ok now I think I understand the problem. We use resolve() to resolve symlinks on read. Now this is useful if the name of the second split of the file is still the old one in the original file. But in your case the split-01 file points to the correct split-02 file as see by:

❯ mne show_fiff temp/sub-01/ses-01/meg/sub-01_ses-01_task-real_run-01_split-01_meg.fif -t 118 118 = FIFFB_REF 115 = FIFF_REF_ROLE (4b >i4) = [2] 118 = FIFF_REF_FILE_NAME (47b str) = sub-01_ses-01_task-real_run-01_split-02_meg.fif ... str len=47 116 = FIFF_REF_FILE_ID (20b ids) = {'version': 65537, 'machid': array([ 3173893, -1922170880], dtype=i ... dict len=4 117 = FIFF_REF_FILE_NUM (4b >i4) = [1]

(note I had to extend the default string length in show_fiff function to see the full file)

so here we should not resolve. Now I assume this was added cause users wanted to avoid writing new files hence keep the original pointers to the old naming in the storage server.

what do you think we should do?

Message ID: @.***>

Sep 04 '22 19:09 agramfort

Now I assume this was added cause users wanted to avoid writing new files hence keep the original pointers to the old naming in the storage server.

In this scenario, what would happen if the split field is specified in the path for reading? It seems that specifying it solves the issues with symlinks that I have encountered. If they also work for the situation you describe, one solution could be to require the split field when reading split files.

Sep 05 '22 06:09 eort

specifying the split entity is not that easy as if you have fif files > 2GB you get split files automatically...

as the 2 uses cases (resolve or not symlinks) makes sense I am wondering if we could not have a resolver_symlink param in read_raw_bids

anyone sees an alternative to this?

Message ID: @.***>

Sep 05 '22 07:09 agramfort

similar issues have come up before, see https://github.com/mne-tools/mne-python/issues/9221 and the fix (though apparently not enough of a fix?) in https://github.com/mne-tools/mne-python/pull/9227.

cc @adswa in case she has some ideas here.

Sep 05 '22 14:09 drammock

mne-bids mne-bids copied to clipboard

[BUG] unintuitive behavior with split files

Description of the problem

Steps to reproduce

Expected results

Actual results

Additional information

mne-bids
mne-bids copied to clipboard