das icon indicating copy to clipboard operation
das copied to clipboard

training and predicting events with multi-channel data

Open avakiai opened this issue 1 year ago • 1 comments

Hi Jan!

I hope all is well!

I currently am working on a multi-channel dataset for which I would like to use DAS to predict vocalization events and I wanted to ask a few questions about: (a) how DAS deals with multiple channels under the hood, and (b) how best to organize multi-channel data/annotations to maximize prediction performance.

The data in brief: vocalizations were recorded from pairs of animals (bats) in a small enclosure using two microphones, each of which was pointed toward its focal animal and away from its non-focal animal. I now want to detect all vocalizations across both channels and then assign detected vocalizations to one of the two channels/animals. Supposing detections are "clean", this latter part should be relatively simple, since it is usually quite clear from the oscillogram amplitude which vocalization came from which animal.

What I am not sure about is how best to organize this data for DAS to detect.

(a) DAS x multi-channel data [Training] 1a. Given that there's currently no option (as far as I know) to assign annotations to specific channels, how does DAS use annotations from multi-channel data to predict events? I.e. are samples within the annotations taken collapsed over all channels or considered separately? 1b. How do the number of TCN blocks and the separable TCN block parameters affect this behavior? E.g. settings such as nb_conv = 4, use_separable=[True, True, False, False]

[Predicting] 2. Does DAS predict events for each channel separately (treating them as if they they are essentially distinct recordings) or does it combine channel information in some way?

(b) Multi-channel data x DAS A few points: In my data, high-amplitude calls will appear on both channels, but the source channel is usually easy to determine based on the amplitude profile. Low-amplitude calls from one animal typically only appear on one channel (that corresponding to the animal's focal mic). Calls that appear on both channels often appear on the "non-focal channel" at a slight temporal delay (1-2 ms?). A problem arises where some calls are so short (some echolocation calls), that their appearance across the two channels feature little to no temporal overlap (in addition to altered ampltiude and frequency profiles).

Given this, I wonder whether it would be "better for DAS" to split data and annotations by channel or to provide multi-channel data with the combined annotations (either by annotation everything or only those calls that "belong" to that channel). I fear that splitting the channels would generate a lot of redundant detections and possible mislabelings that may be hard to post-process with high reliability.

In short, my question is:

  1. If multiple channels provide distinct but redundant information about audio events, would it help the model to "see" all this information at once, or would it be best to provide only single-channel data+annotations relating to only that channel?

Attempts:

What I did so far was train a model on the multi-channel data where I only labelled the calls as they appeared on the "correct" channel. My logic and hope (crucially dependent on how DAS treats multi-channel data) is that DAS may learn to ignore events that appear altered and at a delay across multiple channels and will give preference to the higher SNR/temporally leading event.

Data was prepared with a .6:.2:.2 train:val:test split.

Model params:

mparams = dict(nb_hist = 8192, 
               pre_nb_conv = 4,               
               nb_filters = 32, 
               kernel_size = 32,
               learning_rate = 0.001,
               nb_conv = 4, 
               use_separable=[True, True, False, False],
               nb_epoch = 400,
               reduce_lr = True, 
               reduce_lr_patience = 10, 
               seed = None) 

Model history (not great): download

Classification report for annotation classes echo, highfreq, lowfreq, isolation, also training for offsets and onsets:

{'echo': {'f1-score': 0.0, 'precision': 0.0, 'recall': 0.0, 'support': 5681.0},
 'highfreq': {'f1-score': 0.0,
  'precision': 0.0,
  'recall': 0.0,
  'support': 3827.0},
 'isolation': {'f1-score': 0.8395043859875838,
  'precision': 0.9646930241398973,
  'recall': 0.7430751914788482,
  'support': 109281.0},
 'lowfreq': {'f1-score': 0.30887587553105983,
  'precision': 0.42998721227621484,
  'recall': 0.24099623723347072,
  'support': 11162.0},
 'macro avg': {'f1-score': 0.405044470431422,
  'precision': 0.4534109585408506,
  'recall': 0.3724207896436213,
  'support': 6094848.0},
 'noise': {'f1-score': 0.9885851908788111,
  'precision': 0.9828217208437938,
  'recall': 0.9944166560140532,
  'support': 5820168.0},
 'syllable_offset': {'f1-score': 0.36647618497193346,
  'precision': 0.3831883007009911,
  'recall': 0.35116089105483944,
  'support': 72229.0},
 'syllable_onset': {'f1-score': 0.3318696556505662,
  'precision': 0.41318645182505753,
  'recall': 0.2772965517241379,
  'support': 72500.0},
 'weighted avg': {'f1-score': 0.9679408252598818,
  'precision': 0.966068881541113,
  'recall': 0.9708255234585014,
  'support': 6094848.0},
 'accuracy': 0.9708255234585014}

Would you have any advice on how to tune this model up, or whether to try a different tack altogether?

I know this is highly specific, but I reckoned that information about DAS's dealings with multi-channel data may help other users with similar data.

Thank you so much!

Ava

avakiai avatar Feb 29 '24 18:02 avakiai

Hi Ava!

[Training] 1a. Given that there's currently no option (as far as I know) to assign annotations to specific channels, how does DAS use annotations from multi-channel data to predict events? I.e. are samples within the annotations taken collapsed over all channels or considered separately? 1b. How do the number of TCN blocks and the separable TCN block parameters affect this behavior? E.g. settings such as nb_conv = 4, use_separable=[True, True, False, False]

You are correct, annotations are not assigned specific channels. The network gets the raw, multi-channel data as input and learns to combine the audio information across channels during training. The number of TCN blocks has nothing to do with how multi-channel data is processed. Whether TCN blocks are separable or not does to some extent - a "separable TCN block" is one that first applies a temporal filter to each channel separately, and then combines the multi-channel data in a separate step. By contrast, in a regular "non-separable" TCN block, the filters have both a temporal and a channel component and require many more parameters, since the temporal component of the filter can differ across channels . This is similar to space-time separable filters in vision. Separable TCN blocks are therefore a way to reduce the number of parameters in the network without loosing much representational capacity. Setting the first 1-2 TCN blocks to separable is a good idea.

[Predicting] 2. Does DAS predict events for each channel separately (treating them as if they they are essentially distinct recordings) or does it combine channel information in some way? See above - information is combined across channels inside the network. So predictions are over all channels.

If multiple channels provide distinct but redundant information about audio events, would it help the model to "see" all this information at once, or would it be best to provide only single-channel data+annotations relating to only that channel? I would try both approaches - train a multi-channel network and train single-channel networks. As you said, the multi-channel network has the advantage of you not having to deal with merging potentially conflicting annotations across channels. And it sounds like assigning events to channels could be done based on amplitude. The single-channel network might be easier to train - in particular, since DAS does not need to learn that signals can appear on the two channels separately. Merging detections then might be a pain but you could use the confidence values to clean them up.

Hope any of this makes sense. Happy to help if you have more questions!

postpop avatar Mar 11 '24 14:03 postpop