spikeinterface
spikeinterface copied to clipboard
Improve noise level machinery
- Add machinery to compute noise level in paralell
- Add
get_random_recording_slices()to implement more futur) strategy for random chunk (aka non overlaping, regular, uniform...)
A very important change is that now the seed=None (instead of seed=0) in the function which I think is the good way. seed must be explicit and no inplicit. So the consequence is:
- all test that are running twice the
get_random_data_chunk()(sometimes this isn hidden) are not garanted anymore to have the same results. The solution is to explicitly seed everywhere which is a good practice.
@yger
@cwindolf : have a look to this, this is a first step to have a better noise levels estimate in SI.
Looks cool! @oliche 's strategy could be implemented here now.
- Will this not fail with formats that lock IO access to the same region if the chunks overlap?
Really good point! The access is read only. We will be able to add a none overlapiing option in random slices
- I am still confused on why computing noise requires so many samples. The methods we use assume normality (we use MAD to estimate the std) but then we go and sample far way more than the converge criteria of normal distributions would naively suggests. What gives? Is there some empirical work on this? Now that we have a lot of open data available estimating sampling requirements for a variety of neural data (species, areas, etc) coul be done. It appears to me that this could be a quick and informative paper that we could put out for the community if there is no previous work.
Honestly I was pretty sure that the number of sample used to be enough. After discussion with @cwindolf I get the impression that we should have more ... Charlie any coment ?
Yeah... in my experience, more blocks helps to stabilize the estimate (let's say we want numbers within x% of each other across runs with different seeds). The data certainly is not Gaussian, it has spikes, and spike activity can vary wildly across a recording. So using very few blocks, they will by chance disproportionately land in higher or lower activity regions (maybe in different ways across channels). You need a good number of blocks to reduce that effect -- for short or super consistent recordings maybe fewer blocks is fine.
Also, it would be cool if si.zscore() and the other normalize_scale stuff can use these tools :)
But if the data is not Gaussian would it mean that using MAD to estimate std is wrong? This assumes normality:
https://en.wikipedia.org/wiki/Median_absolute_deviation (see relationship to Relation to standard deviation here)
Anyway, if your experience is that more samples stabilize the estimator that I think trumps these considerations.
Yeah, it's wrong! But I don't have any better ideas. Ideally one would be able to subtract away all of the spikes and then MAD the residuals (which would ideally be only Gaussian noise, but even that is not 100% true...), but that requires sorting, which requires some kind of standardization...
Agree on the limitation. Thanks for answering my questions.
@samuelgarcia can you fix the test? There are some concatenated that trigger some errors