mne-python icon indicating copy to clipboard operation
mne-python copied to clipboard

Function for creating toy data

Open cbrnr opened this issue 1 year ago • 17 comments

I often find myself generating toy data (e.g. for educational or testing purposes), so I thought a dedicated function might be useful.

It should be as simple as possible, for example:

from mne import create_info
from mne.io import RawArray
from numpy.random import default_rng

def create_toy_data(n_channels=3, duration=25, sfreq=250, seed=None):
    rng = default_rng(seed)
    data = rng.standard_normal(size=(n_channels, duration * sfreq)) * 5e-6
    info = create_info(n_channels, sfreq, "eeg")
    return RawArray(data, info)

It is important that there are sensible defaults for all parameters, which makes it possible to generate toy data very quickly:

raw = create_toy_data()

If people think this would be useful, I can go ahead and submit a PR.

Of course, this function could have a lot of additional parameters, such as

  • the kind of generated data (raw, epochs, evoked),
  • the probability distribution to sample from,
  • the channel type,
  • the data scaling,
  • ...

However, I'd say YAGNI until someone really needs a particular feature.

If there is interest, I have two questions:

  1. Where should this function live? I'd put it in mne.misc and export it to mne, but there might be a better place.
  2. The recommended method for generating random numbers is numpy.random.default_rng(seed), but the seed is not compatible with how we are dealing with random state (check_random_state()). If the function gets a seed or random_state parameter, how should we handle this?

So – yay or nay?

cbrnr avatar Jul 25 '22 11:07 cbrnr

I love the idea of adding such a function!

  1. Where should this function live? I'd put it in mne.misc and export it to mne, but there might be a better place.

IMHO a misc and utils namespace shouldn't even exist; it's only an invitation to dump arbitrary stuff there without ever cleaning it up; and the naming is not explicit, therefore not helping users either.

The mne namespace is overloaded already, I'd avoid exposing it there.

Why not simply add a new module & namespace, mne.toydata? Or even mne.datasets.toydata, idk?

  1. The recommended method for generating random numbers is numpy.random.default_rng(seed), but the seed is not compatible with how we are dealing with random state (check_random_state()). If the function gets a seed or random_state parameter, how should we handle this?

check_random_state() already handles RandomState instances, doesn't that suffice?

hoechenberger avatar Jul 25 '22 12:07 hoechenberger

Some thoughts off-the-cuff:

  • I'm hesitant to make such a thing public because it invites a lot of user bikeshedding based on divergent use cases
  • misc seems fine to me, but if that's objectionable the simulation module seems better than creating a new one
  • IMO the first arg should be a string saying what kind of object you want, and should support raw, epochs, evoked, and stc

drammock avatar Jul 25 '22 13:07 drammock

To me this seems very close to what you would get with add_noise(raw, cov) for an empty raw and cov=mne.make_ad_hoc_cov(...) after some suitable call to mne.simulation.simulate_raw.

I'd rather extend these existing functions than add anything new

larsoner avatar Jul 25 '22 13:07 larsoner

I'm fine with extending existing functions, and the mne.simulation module seems like a good place for that functionality. Initially I thought because there is mne.create_info(), an mne.create_toy_data() would be the expected/consistent place.

check_random_state() already handles RandomState instances, doesn't that suffice?

But a numpy.random.Generator (which is returned by numpy.random.default_rng()) does not support a RandomState instance unfortunately.

I'm hesitant to make such a thing public because it invites a lot of user bikeshedding based on divergent use cases

That's why I'd intentionally keep it simple.

Re extending mne.simulation.simulate_raw(), I like the idea, but the point of my proposed function is that you can just call it without setting any parameters, and you get some reasonable toy data. Not sure how this could be handled, do you mean we could add my proposed function there and just call already existing functions? Or do you really mean you'd rather not add any new function?

cbrnr avatar Jul 25 '22 14:07 cbrnr

Or do you really mean you'd rather not add any new function?

I'd rather not add any new function. What you propose seems 90%+ like an existing function. No need to make something new to do almost the same thing just to save someone having to set one or two parameters of the function

larsoner avatar Jul 25 '22 14:07 larsoner

I'd rather not add any new function. What you propose seems 90%+ like an existing function. No need to make something new to do almost the same thing just to save someone having to set one or two parameters of the function

But what you proposed consists of several function calls or no? I agree that mne.simulation.simulate_raw() should be able to do what I want (actually, I don't even care if the data is EEG-like or random, I only need the right data structure with a given length and number of channels). I'm not sure if it is actually easier to adapt mne.simulation.simulate_raw() to generate toy data without needing to specify STCs and whatnot than to add a small new function (maybe mne.simulation.toy_raw()?).

cbrnr avatar Jul 25 '22 14:07 cbrnr

Could we make the simulate_xxx method all have default arguments to do the job ?

agramfort avatar Jul 25 '22 15:07 agramfort

I'm not sure if it is a good idea to coerce simulate_raw() into not simulating and instead outputting some random data (with desired shape). I still think a separate function in that module would be the best solution. My latest name idea is generate_raw(). Or create_toy_raw(). The "toy" part is probably important to show that the signal is not EEG or something plausible.

And it would fit in either misc or simulation. Even data or utils would be possible IMO.

cbrnr avatar Aug 01 '22 12:08 cbrnr

Scikit-learn has datasets.make_*() for this purpose BTW.

cbrnr avatar Aug 01 '22 12:08 cbrnr

Scikit-learn has datasets.make_*() for this purpose BTW.

I like this. This or simulation.

hoechenberger avatar Aug 01 '22 13:08 hoechenberger

To me I think we should still just make our existing functions better -- so far what you've described @cbrnr is in my mind just a 2- or 3-line wrapper around existing functions. There are lots of potential ways to make our existing functions easier to use.

For example maybe support data=<int> in RawArray to mean "give me an array of zeros for all channels of this many samples". Then your use case is

info = create_info(...)  # I think in any API you're going to need this line
raw = mne.simulation.add_noise(RawArray(10000, info), ...)

The bonus of this API is you can do things like

epochs = mne.simulation.add_noise(EvokedArray(10000, info), ...)

etc. immediately because we already have these other classes, and add_noise knows how to deal with them.

larsoner avatar Aug 01 '22 14:08 larsoner

I thought you meant extending mne.simulation.simulate_raw(). I like adapting mne.io.RawArray to give an empty array, but how would you handle this with mne.EpochsArray? Pass a 2D array with (n_epochs, n_channels)?

I think it depends on how many people would use this functionality. Many things could be done with existing functions to a certain degree, but at some point it might make sense to put it into a dedicated function.

cbrnr avatar Aug 01 '22 14:08 cbrnr

I'd like to revive this issue. A function to create some toy data (not simulated data) would be extremely useful for me (and probably others), because I need this in almost any MWE. And as @larsoner said, of course it is just a wrapper around existing functions, but not 2 or 3 lines, but 7 lines at least (you need the imports).

I think there was at least some consensus for mne.datasets.make_toy_*()?

cbrnr avatar Nov 24 '23 17:11 cbrnr

would you start to make use of this function in our tests? maybe it would be a concrete opportunity to use this and reduce also our number of lines?

Message ID: @.***>

agramfort avatar Nov 28 '23 09:11 agramfort

would you start to make use of this function in our tests? maybe it would be a concrete opportunity to use this and reduce also our number of lines?

Yes, this would very likely lead to shorter tests. I'll have to investigate a little, but I cannot do it right now. I just wanted to make sure that there is still interest, or if we can close this issue.

cbrnr avatar Nov 28 '23 10:11 cbrnr

I'm interested in seeing something like this happen. My use case is mostly for MWEs though: i.e., when debugging user problems from the forum or demonstrating how to do things. I want to avoid having to write

sample_data_folder = mne.datasets.sample.data_path()
sample_data_raw_file = sample_data_folder / "MEG" / "sample" / "sample_audvis_raw.fif"
raw = mne.io.read_raw_fif(sample_data_raw_file, verbose=False, preload=False)

just to try out what the user says isn't working. Other times random data is good enough too. So I think my ideal would be something like:

mne.simulation.example(
    kind: str = "raw",  # can add "epochs", "evoked", "spectrum", "stc", "tfr" if needed
    data: str = "random",  # or "sample" to use sample dataset, add other dataset wrappers if needed
    info: Info|None = None,  # if data="random" you can provide an info if you don't like built-in defaults
)

name of the function doesn't matter much to me; could be example or make_example or make_example_data or make_toy_data or fake_object or whatever.

FWIW I don't actually expect this to help all that much in our test suite, since nowadays we have fixtures for raw, epochs, evoked, spectrum, and (I think?) stc. There will be a some tests that might benefit from this, but then again it might also be possible to update them to use the fixtures instead.

drammock avatar Nov 29 '23 17:11 drammock

My use case is mostly for MWEs though

This was also my initial motivation for adding such a function, but then the discussion was mainly about extending functionality of available functions.

cbrnr avatar Nov 29 '23 18:11 cbrnr