Running into this error while trying to train on our data.
$ python df/train.py data-hdf5/dataset.cfg data-hdf5/ base_dir/
...
2023-12-06 02:40:53 | INFO | DF | Start train epoch 2 with batch size 1
thread 'DataLoader Worker 1' panicked at 'assertion failed: k <= self.len()', /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/slice/mod.rs:3420:9
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
Aborted (core dumped)
same here, could you find the cause?
In my case is something related to the RIRs because when p_reverb=0.0 it trains normally but when p_reverb=1.0 it gets stuck and killed with the error message above. Trace seems normal:
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1275 | Sampled RIR .._.._guso_in24_rirs_train_recsourcedirectivityHA_right_recsourcedirectivityHA_right_07966.wav with shape [1, 20305]
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataloader:279 | Worker: Getting sample 270566 with seed 270566
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1219 | get_sample() idx 270566 with seed 270566, snr 5, gain -6
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1071 | Loaded sample .._.._DNS-Challenge_datasets_fullband_noise_fullband_Zy0goYEHPHU.wav with codec PCM
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1275 | Sampled RIR .._.._guso_in24_rirs_train_recsourcedirectivityHA_right_recsourcedirectivityHA_right_31149.wav with shape [1, 24812]
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1071 | Loaded sample .._.._DNS-Challenge_datasets_fullband_clean_fullband_read_speech_book_02509_chp_0002_reader_03315_40_seg_1.wav with codec PCM
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1071 | Loaded sample .._.._DNS-Challenge_datasets_fullband_noise_fullband_ay2X87w6Dxw.wav with codec PCM
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1275 | Sampled RIR .._.._guso_in24_rirs_train_recsourcedirectivityHA_right_recsourcedirectivityHA_right_11673.wav with shape [1, 98547]
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1275 | Sampled RIR .._.._guso_in24_rirs_train_recsourcedirectivityHA_right_recsourcedirectivityHA_right_55381.wav with shape [1, 6963]
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1071 | Loaded sample .._.._DNS-Challenge_datasets_fullband_noise_fullband_GAc5dEFDkac.wav with codec PCM
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1071 | Loaded sample .._.._DNS-Challenge_datasets_fullband_noise_fullband_lt7jAlr_Er0.wav with codec PCM
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1071 | Loaded sample .._.._DNS-Challenge_datasets_fullband_noise_fullband_SbJmk_6PVWg.wav with codec PCM
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataloader:279 | Worker: Getting sample 272373 with seed 272373
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1219 | get_sample() idx 272373 with seed 272373, snr 5, gain 0
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataloader:279 | Worker: Getting sample 365896 with seed 365896
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1219 | get_sample() idx 365896 with seed 365896, snr 5, gain 6
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1071 | Loaded sample .._.._DNS-Challenge_datasets_fullband_noise_fullband_VX2czCvwQG0.wav with codec PCM
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1071 | Loaded sample .._.._DNS-Challenge_datasets_fullband_noise_fullband_lZW6oaScJPc.wav with codec PCM
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:augmentations:555 | Augmentation RandClipping (c: 0.3069719)
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1071 | Loaded sample .._.._DNS-Challenge_datasets_fullband_noise_fullband_squeak_squeakyChair_Freesound_validated_379901_0.wav with codec PCM
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataloader:279 | Worker: Getting sample 28207 with seed 28207
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1219 | get_sample() idx 28207 with seed 28207, snr 40, gain 6
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1071 | Loaded sample .._.._DNS-Challenge_datasets_fullband_clean_fullband_german_speech_CC_BY_SA_4.0_249hrs_339spk_German_Wikipedia_16k_German_Wikipedia_Schlosspark_Nymphenburg_audio_48kHz_seg_7.wav with codec PCM
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1071 | Loaded sample .._.._DNS-Challenge_datasets_fullband_noise_fullband_door_Freesound_validated_406193_0.wav with codec PCM
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1275 | Sampled RIR .._.._guso_in24_rirs_train_recsourcedirectivityHA_right_recsourcedirectivityHA_right_24170.wav with shape [1, 49323]
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1275 | Sampled RIR .._.._guso_in24_rirs_train_recsourcedirectivityHA_right_recsourcedirectivityHA_right_31962.wav with shape [1, 35320]
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1071 | Loaded sample .._.._DNS-Challenge_datasets_fullband_noise_fullband_F0IYjZN8ojA.wav with codec PCM
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1071 | Loaded sample .._.._DNS-Challenge_datasets_fullband_noise_fullband_TG7zqe3C7yw.wav with codec PCM
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1071 | Loaded sample .._.._DNS-Challenge_datasets_fullband_noise_fullband_RG2sjK0Zsng.wav with codec PCM
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataloader:279 | Worker: Getting sample 17304 with seed 17304
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1219 | get_sample() idx 17304 with seed 17304, snr 0, gain 0
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataloader:279 | Worker: Getting sample 112016 with seed 112016
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1219 | get_sample() idx 112016 with seed 112016, snr 20, gain 0
2024-02-12 13:33:55 | TRACE | libdfdata.torch_dataloader:df:dataset:1071 | Loaded sample .._.._DNS-Challenge_datasets_fullband_clean_fullband_read_speech_book_04432_chp_0002_reader_10614_109_seg_2.wav with codec PCM
I cannot find anything weird, I assume that the problem comes from the next RIR the dataloader is trying to load.
I also have tried to check all my RIRs one by one in python, loading with soundfile and with the following tests in numpy:
- shape [1, x], with x at least 100ms
- no NaNs
- sampling rate is correct
- values in [-1, 1] so no clipping
Any ideas on what could be causing this?
Also, as pointed out by the OP, the bug might appear in epoch>0, so it has to be related with particular combinations of speech and RIRs.
Apparently I was mistaken and it has nothing to do with the IRs, because sometimes re-running the epoch without changing anything fixes it:
2024-03-16 02:48:11 | INFO | DF | Start train epoch 65 with batch size 38
2024-03-16 02:48:25 | INFO | DF | [65] [ 0/11514] | loss: 1.04959 | t_sample: 10.65434 | t_batch: 10.68427 | lr: 4.536E-04 | wd: 0.00565
thread 'DataLoader Worker 5' panicked at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/slice/mod.rs:3475:9:
assertion failed: k <= self.len()
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
Killed
(dfn) ubuntu@op-mm-guestxr:~/enric/DeepFilterNet/DeepFilterNet$ python df/train.py d07_conf.cfg /home/ubuntu/Data/DFN/data/ /home/ubuntu/Data/DFN/D07_mb_srcrec_HA_left/
/home/ubuntu/enric/venvs/dfn/lib/python3.8/site-packages/df/io.py:9: UserWarning: torchaudio.backend.common.AudioMetaData has been moved to torchaudio.AudioMetaData. Please update the import path.
from torchaudio.backend.common import AudioMetaData
2024-03-16 09:07:16 | INFO | DF | Running on torch 2.2.0+cu121
2024-03-16 09:07:16 | INFO | DF | Running on host op-mm-guestxr
fatal: not a git repository (or any of the parent directories): .git
2024-03-16 09:07:16 | INFO | DF | Loading model settings of D07_mb_srcrec_HA_left
2024-03-16 09:07:35 | INFO | DF | Running on device cuda:0
2024-03-16 09:07:35 | INFO | DF | Initializing model deepfilternet3
2024-03-16 09:07:50 | INFO | DF | Found checkpoint /home/ubuntu/Data/DFN/D07_mb_srcrec_HA_left/checkpoints/model_65.ckpt with epoch 65
2024-03-16 09:07:51 | INFO | DF | Initializing dataloader with data directory /home/ubuntu/Data/DFN/data/
2024-03-16 09:07:57 | INFO | DF | Loading HDF5 key cache from .cache_d07_conf.cfg
2024-03-16 09:07:57 | INFO | DF | Running with learning rate scheduling
2024-03-16 09:08:02 | INFO | DF | Start train epoch 65 with batch size 38
2024-03-16 09:08:47 | INFO | DF | [65] [ 0/11514] | loss: 1.09686 | t_sample: 13.67483 | t_batch: 13.70243 | lr: 4.536E-04 | wd: 0.00565
2024-03-16 09:18:19 | INFO | DF | [65] [ 100/11514] | loss: 1.24818 | t_sample: 4.37480 | t_batch: 4.41834 | lr: 4.535E-04 | wd: 0.00565
2024-03-16 09:26:26 | INFO | DF | [65] [ 200/11514] | loss: 1.21831 | t_sample: 4.83193 | t_batch: 4.86833 | lr: 4.534E-04 | wd: 0.00565`
@macso-vincent-russell could you find the cause?
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.
Small update: the issue still persists when trying to train using the meta SoundSpaces RIR dataset. Using the default DFN3 recipe but changing p_reverb to 1.0:
2024-07-03 11:52:51 | INFO | DF | Start train epoch 0 with batch size 16
2024-07-03 11:53:39 | INFO | DF | [0] [ 0/27346] | loss: 10.75046 | t_sample: 4.41516 | t_ba│
tch: 4.42887 | lr: 1.000E-04 | wd: 1.000E-12
thread 'DataLoader Worker 11' panicked at 'assertion failed: k <= self.len()', /rustc/5680fa18feaa87f│
3ff04063800aec256c3d4b4be/library/core/src/slice/mod.rs:3420:9
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
Aborted (core dumped)
https://github.com/Rikorose/DeepFilterNet/issues/584