lhotse icon indicating copy to clipboard operation
lhotse copied to clipboard

Possible SpecAugment issue

Open videodanchik opened this issue 2 years ago • 4 comments

I have a question about the current SpecAugment implementation. According to lhotse code, the time warping is applied only for "true" feature regions and it excludes "padded" regions i.e.

# Supervisions provided - we will apply time warping only on the supervised areas.
for sequence_idx, start_frame, num_frames in supervision_segments:
    end_frame = start_frame + num_frames
    features[sequence_idx, start_frame:end_frame] = self._forward_single(
        features[sequence_idx, start_frame:end_frame], warp=True, mask=False
    )

but later on, the masking step is applied to the whole feature matrices including the "padded" regions

# ... and then time-mask the full feature matrices. Note that in this mode,
# it might happen that masks are applied to different sequences/examples
# than the time warping.
for sequence_idx in range(features.size(0)):
    features[sequence_idx] = self._forward_single(
        features[sequence_idx], warp=False, mask=True
    )

For masking along the time axis it is fine, but for masking along the frequency axis we can end up masking "padded" regions for shorter segments in the batch. Is it intentional?

videodanchik avatar Mar 28 '22 22:03 videodanchik

Yeah IIRC I left it like that as @danpovey was thinking it would either not make a difference or maybe help somehow, but I don't think we ever actually run an experiment to compare with masking speech only.

pzelasko avatar Mar 29 '22 00:03 pzelasko

mm, my suspicion is the network will be ignoring those regions so it won't make a difference. We actually use attention masks (and recently, I am also masking in the convolution module) so those regions are going to be totally ignored, well once we apply the convolution mask.

danpovey avatar Mar 31 '22 03:03 danpovey

mm, my suspicion is the network will be ignoring those regions so it won't make a difference. We actually use attention masks (and recently, I am also masking in the convolution module) so those regions are going to be totally ignored, well once we apply the convolution mask.

Ok thanks for the reply, so if I understand correctly since the model ignores the "padded" regions it makes sence to also ignore it during masking. Just to be clear what I'm reffering to here is an example of problem I describe.

Original Fbank from a batch with zero padding:

image

Same Fbank after SpeAugment:

image

As you can see after time warping we ended up with frequency mask overlapping zero padded region and in a fact this leads to useless frequency masking in this cases. Moreover this inconsystency also produces a small bug in current SpecAugment implementation. The line:

_max_tot_mask_frames = self.max_frames_mask_fraction * features.size(0)

will produce the same maximum masked frames across the whole batch as features.size(0) will be the same everywhere. This lead to incorrect num_frame_masks and max_mask_frames calculation. All of the mentioned above can be fixed by replacing

# Supervisions provided - we will apply time warping only on the supervised areas.
for sequence_idx, start_frame, num_frames in supervision_segments:
    end_frame = start_frame + num_frames
    features[sequence_idx, start_frame:end_frame] = self._forward_single(
        features[sequence_idx, start_frame:end_frame], warp=True, mask=False
    )
# ... and then time-mask the full feature matrices. Note that in this mode,
# it might happen that masks are applied to different sequences/examples
# than the time warping.
for sequence_idx in range(features.size(0)):
    features[sequence_idx] = self._forward_single(
        features[sequence_idx], warp=False, mask=True
    )

with

for sequence_idx, start_frame, num_frames in supervision_segments:
    end_frame = start_frame + num_frames
    features[sequence_idx, start_frame:end_frame] = self._forward_single(
        features[sequence_idx, start_frame:end_frame], warp=True, mask=True
    )

In this case you will end up with something like

image

where masking calculated and applied correctly. @pzelasko @danpovey what do you think about this modification, this should probably be tested though at least on LibriSpeech.

videodanchik avatar Mar 31 '22 21:03 videodanchik

Sounds good to me! Would be great if you'd be able to run a proper comparison like you suggested.

pzelasko avatar Mar 31 '22 21:03 pzelasko