pyedflib "sample_frequency" is a misleading name for the number of values in a signal

trafficstars

Hi, since v0.1.23 the getSampleFrequency method has changed the value it returns from the actual sample rate (aka the number of samples per second) to the number of samples in the signal.

This is also documented in code. I don't know the reason why it changed but there probably was a legit reason.

Now when I used the library I stumbled upon this and tracked down the value coming from getSampleFrequency to this line: https://github.com/holgern/pyedflib/blob/master/pyedflib/_extensions/_pyedflib.pyx#L358 which says:

            return <double>self.hdr.signalparam[channel].smp_in_datarecord

So the actual C edflib interface calls this not sample frequency but smp_in_datarecord, which is a much more appropriate name IMO. I really struggled with my code and thought: Well I know I have a sample rate of 500 Hz in my data, but now the file says it is 5000 (my signals are 10 seconds), am I producing the wrong files? (sidenote, I only read BDF files with pyedflib, I'm not writing them with this lib)

This confusion IMO comes solely from the name. I'm happy to calculate the frequency (or rate or however you would like to call it) myself, but I'm 100% sure that someone else later in the code will see my calculation and say "hey there is this getSampleFrequency method provided out of the box, let's use this".

What do you think about changing the name of the attribute altogether and removing any ambiguities?

Nov 18 '21 12:11 gregmuellegger

I just found Py-EDFLib, which seems to be doing the old calculation (e.g. seeing frequency of a number occurence per time interval): https://gitlab.com/Teuniz/EDFlib-Python/-/blob/master/src/EDFlib/edfreader.py#L509

Nov 18 '21 12:11 gregmuellegger

I stumbled upon this issue as I fell for the same problem after an update to 0.1.23 I think @gregmuellegger 's point is valid. I looked it up in the official spec (which I luckily never had to bother with so far) and there this attribute in the header is called nr of samples in each data record too.
IMO getSampleFrequency now does not anymore do what its name advertises.

Nov 18 '21 19:11 hojo0590

This is related to #121 and #117

I totally understand the confusion about the two terms, and I agree that it should be made clearer. In my opinion, sample frequency is colloquially understood as samples per second, and should also in pyedflib be treated as such. We need to think about how to implement this without making breaking changes again.

The underlying C-library edflib names it sample_frequency but means smp_per_record. At the moment we call edf_set_samplefrequency which simply sets the smp_in_datarecord parameter in the C code. I think we should deviate from the way edflib names it and use SampleFrequency the way it is usually understood (i.e. per second) and additionally make a method to set/get the smp_per_record attribute. However, there is also datarecord_duration, and I need to look further into how these two are actually combined when writing/reading the file. Nevertheless, the two should be set-able independently.

Current goal should be:

getSampleFrequency should return the calculated sample frequency (smp/second) per channel
smp_per_record and record_duration should be calculated in the background accordingly.

Additionally, these changes need to be accurately reflected when writing new files that have a record length of !=1second, and in highlevel.write_edf the block_size parameter should be renamed to accordingly reflect this change.

@holgern what are your thoughts on this?

Jan 04 '22 10:01 skjerns

My two cents: In my library, edflib, I did the calculation to give the conventional use of sample frequency

def samplefrequency(self, channel):
        return (<double>self.hdr.signalparam[channel].smp_in_datarecord / self.hdr.datarecord_duration) * EDFLIB_TIME_DIMENSION

Jun 24 '22 23:06 cleemesser

@cleemesser yes, that is also how we will calculate it once #159 is reviewed and merged

btw: I still encourage you to join in on pyedflib instead of working on your fork :) I think pooling resources will be much more efficient than working on two separate projects

Jun 25 '22 10:06 skjerns

Thank you @skjerns for taking time and effort to fix this. 👏

Mar 14 '23 06:03 gregmuellegger

pyedflib pyedflib copied to clipboard

"sample_frequency" is a misleading name for the number of values in a signal

pyedflib
pyedflib copied to clipboard