lhotse
lhotse copied to clipboard
Augmentation with Recording length change
I am encountering an issue when applying length change augmentation in Recording
class. Specifically, I'm facing difficulties with the Speed
augmentation, which is expected to modify the number of samples in the audio.
Steps to Reproduce
-
Import Lhotse and necessary modules.
-
Define an audio source and create a recording with specific attributes:
from pathlib import Path from IPython.display import Audio import lhotse from lhotse.audio import Recording from lhotse.audio.source import AudioSource def play_record(record: Recording): # I use this because there is no play_audio method in Recording class return Audio(record.load_audio(), rate=record.sampling_rate) source = AudioSource( type="file", channels=[0], source=str( Path(lhotse.__file__).parent.parent / "test" / "fixtures" / "ljspeech" / "storage" / "LJ002-0020.wav" ), ) record = Recording( id="LJ002-0020", sources=[source], sampling_rate=22050, num_samples=33949, duration=1.5396371882086168, transforms=None, ) play_record(record)
-
Apply the `ReverbWithImpulseResponse`` augmentation to the Record object. No problems occurred, the augmentation works as expected:
from lhotse.augmentation import ReverbWithImpulseResponse from lhotse.augmentation.utils import FastRandomRIRGenerator rir = FastRandomRIRGenerator() record.transforms = [ ReverbWithImpulseResponse(rir_generator=rir).to_dict(), ] play_record(record)
-
Apply the
Speed
transformation and catch an exception when trying to apply augmentation:from lhotse.augmentation import Speed record.transforms = [Speed(1.1).to_dict()] record.load_audio()
ValueError: The number of declared samples in the recording diverged from the one obtained when loading audio (offset=0.0, duration=None). This could be internal Lhotse's error or a faulty transform implementation. Please report this issue in Lhotse and show the following: diff=3086, audio.shape=(1, 30863), recording=Recording(id='LJ002-0020', sources=[ AudioSource( type='file', channels=[0], source='/.../lhotse/test/fixtures/ljspeech/storage/LJ002-0020.wav')], sampling_rate=22050, num_samples=33949, duration=1.5396371882086168, channel_ids=[0], transforms=[{'name': 'Speed', 'kwargs': {'factor': 1.1}}] ) [extra info] When calling: Recording.load_audio( args=( Recording(id='LJ002-0020', sources=[AudioSource(type='file', channels=[0], source='/.../lhotse/test/fixtures/ljspeech/storage/LJ002-0020.wav')], sampling_rate=22050, num_samples=33949, duration=1.5396371882086168, channel_ids=[0], transforms=[{'name': 'Speed', 'kwargs': {'factor': 1.1}}]), ) kwargs={} )
Expected Behavior
I expect the audio transformation to be applied successfully, altering the length of the recording as specified by the transformation parameters, and that I can play the transformed audio without errors.
Actual Behavior
I encounter the ValueError mentioned above when attempting to apply the "Speed" transformation or a custom transformation that alters the audio length.
Additional information
- There is no problem when using length-preserving transforms such as
Volume
. - The problem also arises when implementing a custom augmentation by inheriting from the
AudioTransform
class.
Am I trying to apply augmentation to the Recording object correctly? I would like to be able to inherit my own lazy augmentation by inheriting from the AudioTransform
class.
Quick note on play_record
: you can do record.to_cut().play_audio()
.
The problem here is that you are adding a transform to the Recording
which changes its duration and num_samples at the time of loading, but you have not made these changes in the manifest. If you look at the implementation of perturb_speed
here, you can see that we also update the samples and duration when using the Speed
transform.
It seems that a good solution would be to redesign the augmentation base class so that the job of recalculating num_samples is taken over by the AudioTransform heir class.
I don't think that's a good solution. The augmentation classes only work on the audio, not the associated metadata, and they should not modify the Recording object itself. That modification should be done from a member function of the Recording class.