snowfall
snowfall copied to clipboard
CUDA out of Memory Error
Hi Team,
Thanks for giving us the features like K2 , Lhotse. They will be a big role in coming times. I am currently running the CTC based training framework using CSJ corpus and as soon as the training proceeds to batch number 3 or 4 in first epoch, cuda memory runs out and training process stops. Currently, the training framework is running on single cuda core. I have 11 GB of memory per cuda core.
Is it possible to run the training on multiple cuda cores ?
Regards, Mohit
Great! If you could help us create an example for CSJ corpus in Lhotse, it would be easier for us to help you debug. Right now we haven't fully debugged multiple CUDA cores (there is a hang). Dan
On Mon, Feb 22, 2021 at 4:48 PM manu51188 [email protected] wrote:
Hi Team,
Thanks for giving us the features like K2 , Lhotse. They will be a big role in coming times. I am currently running the CTC based training framework using CSJ corpus and as soon as the training proceeds to batch number 3 or 4 in first epoch, cuda memory runs out and training process stops. Currently, the training framework is running on single cuda core. I have 11 GB of memory per cuda core.
Is it possible to run the training on multiple cuda cores ?
Regards, Mohit
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/107, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6C6PXTC7XUPTB2DBTTAIK4HANCNFSM4YAC3FZA .
Hi,
Thanks for the reply. Yes surely I can create recipe for CSJ corpus in Lhotse similar to other receipes. I can see that duration in CSJ .wav files ranges from 600 sec to 5245 sec (quite long) and preparing the cutset from them. I think that will be a limiting factor for me.
Any idea how I can handle such varies length of recordings for training?
Regards, Mohit
I would assume that csj comes with some kind of segmentation information so you don't have to train on entire wav files?
On Mon, Feb 22, 2021 at 6:28 PM manu51188 [email protected] wrote:
Hi,
Thanks for the reply. Yes surely I can create recipe for CSJ corpus in Lhotse similar to other receipes. I can see that duration in CSJ .wav files ranges from 600 sec to 5245 sec (quite long) and preparing the cutset from them. I think that will be a limiting factor for me.
Any idea how I can handle such varies length of recordings for training?
Regards, Mohit
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/107#issuecomment-783270246, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3IQU2A2EL3WNWL7Q3TAIWVTANCNFSM4YAC3FZA .
As Dan said, there will typically be segmentation information (in Lhotse represented as SupervisionSet). After you create the CutSet, call .trim_to_supervisions() on it to have each cut represent a single segment.
Hi @danpovey , @pzelasko ,
Thanks very much for the response. I will try trim_to_supervisions() to check the results.
Thanks and Regards, Mohit
Hi @pzelasko ,
I have added .train_to_supervison() to represent each cut a single segment. But unfortunately there is an assertion error in the validation error from the following lines : https://github.com/lhotsespeech/lhotse/blob/master/lhotse/dataset/speech_recognition.py#L122
assert (cut.start - 1e-5) <= supervision.start <= supervision.end <= (cut.end + 1e-5),
AssertionError: Cutting in the middle of a supervision is currently not supported for the ASR task. Cut ID violating the pre-condition: '218b7cc8-594f-4114-b2e7-67863db7f0ce'.
Can you find and print the cut with that ID (cut_set['218b7cc8-594f-4114-b2e7-67863db7f0ce'])? I wonder if this is a rounding error or something bigger.
Hi @pzelasko,
Yes I printed the information for CutID ['218b7cc8-594f-4114-b2e7-67863db7f0ce']. Here is what I get,
"id": "218b7cc8-594f-4114-b2e7-67863db7f0ce",
"start": 0.32775,
"duration": 0.4866875,
"channel": 0,
"supervisions": [
{
"id": "R01M0278_0000295_0000733_sp0.9",
"recording_id": "R01M0278_sp0.9",
"start": 0.0,
"duration": 0.4866875,
"channel": 0,
"text": " \u5b87\u5b99+\u540d\u8a5e",
"speaker": "R01M0278"
}
],
"features": {
"type": "fbank",
"num_frames": 9180,
"num_features": 40,
"frame_shift": 0.01,
"sampling_rate": 16000,
"start": 0.0,
"duration": 91.80225,
"storage_type": "lilcom_hdf5",
"storage_path": "exp/data/fbank/train_all/feats-0.h5",
"storage_key": "47fb1d58-d402-4925-a06a-eec4f460041b",
"channels": 0
},
"recording": {
"id": "R01M0278_sp0.9",
"sources": [
{
"type": "file",
"channels": [
0
],
"source": "/home/sysadmin/CSJ_RAW/WAV/noncore/R01M0278.wav"
}
],
"sampling_rate": 16000,
"num_samples": 1468836,
"duration": 91.80225,
"transforms": [
{
"name": "Speed",
"kwargs": {
"factor": 0.9
}
}
]
},
"type": "Cut"
},
I am consistently finding the cutset is starting in the middle of supervision for other cutsets as well (most of them I say).
For example this one, { "id": "6f320bf1-2132-421b-9f7d-a8218acb7066", "start": 79.54275, "duration": 3.1881875, "channel": 0, "supervisions": [ { "id": "S05M0469_0087497_0091004_sp1.1", "recording_id": "S05M0469_sp1.1", "start": 0.0, "duration": 3.1881875, "channel": 0, "text": " \u79c1+\u4ee3\u540d\u8a5e \u304c+\u52a9\u8a5e/\u683c\u52a9\u8a5e \u5352\u696d+\u540d\u8a5e \u3059\u308b+\u52d5\u8a5e/\u30b5\u884c\u5909\u683c/\u9023\u4f53\u5f62 \u9803+\u540d\u8a5e \u304a\u30fc+\u611f\u52d5\u8a5e", "speaker": "S05M0469" } ], "features": { "type": "fbank", "num_frames": 73800, "num_features": 40, "frame_shift": 0.01, "sampling_rate": 16000, "start": 0.0, "duration": 737.9996875, "storage_type": "lilcom_hdf5", "storage_path": "exp/data/fbank/train_all/feats-5.h5", "storage_key": "b7cecc64-68d3-4c9a-83ca-f3a6ba720e64", "channels": 0 }, "recording": { "id": "S05M0469_sp1.1", "sources": [ { "type": "file", "channels": [ 0 ], "source": "/home/sysadmin/CSJ_RAW/WAV/noncore/S05M0469.wav" } ], "sampling_rate": 16000, "num_samples": 11807994, "duration": 737.999625, "transforms": [ { "name": "Speed", "kwargs": { "factor": 1.1 } } ] }, "type": "Cut" },
The cuts look correct to me. The „start” field has a different semantics in cut and in supervision: for cut, it is relative to the start of the recording; for the supervision, it is relative to the start of the cut. It seems the assertion in the K2SpeechRecognitionDataset is incorrect; I will commit a fix later today.
Ok. Thanks.
I will try the training again after your fix.
Regards, Mohit
I merged it, please check out the latest Lhotse and try again:
pip uninstall lhotse
pip install git+https://github.com/lhotse-speech/lhotse
Hi,
I have re-run the training and it still fails at the same validation check with following error :
Traceback (most recent call last):
File "./ctc_train.py", line 408, in
AssertionError: Supervisions starting before the cut are not supported for ASR (sup id: A01M6710_0604926_0615368_sp1.1, cut id: 4fb131fa-272c-437b-a9bb-005671c25cb6)
and here is the cutset for id 4fb131fa-272c-437b-a9bb-005671c25cb6 :
"id": "4fb131fa-272c-437b-a9bb-005671c25cb6",
"start": 559.4254375,
"duration": 5.3045625,
"channel": 0,
"supervisions": [
{
"id": "A01M6710_0604926_0615368_sp1.1",
"recording_id": "A01M6710_sp1.1",
"start": -9.4926875,
"duration": 9.49275,
"channel": 0,
"text": " \u3042\u306e\u30fc+\u611f\u52d5\u8a5e \u5f0a\u793e+\u540d\u8a5e/\u4e00\u822c \u306e+\u52a9\u8a5e/\u683c\u52a9\u8a5e \u3088\u3046+\u540d\u8a5e/\u975e\u81ea\u7acb/\u52a9\u52d5\u8a5e\u8a9e\u5e79 \u306b+\u52a9\u8a5e/\u526f\u8a5e\u5316 \u3053\u3046\u3044\u3046+\u9023\u4f53\u8a5e \u3044\u308d\u3093\u306a+\u9023\u4f53\u8a5e \u8077\u7a2e+\u540d\u8a5e/\u4e00\u822c \u304c+\u52a9\u8a5e/\u4e00\u822c/\u683c\u52a9\u8a5e \u8f09\u3063+\u52d5\u8a5e/\u81ea\u7acb/\u9023\u7528\u30bf\u63a5\u7d9a/\u4e94\u6bb5\u30fb\u30e9\u884c \u3066\u308b+\u52d5\u8a5e/\u975e\u81ea\u7acb/\u57fa\u672c\u5f62/\u4e00\u6bb5 \u30b5\u30a4\u30c8+\u540d\u8a5e/\u4e00\u822c \u3067+\u52a9\u8a5e/\u4e00\u822c/\u683c\u52a9\u8a5e \u30fc+\u540d\u8a5e/\u4e00\u822c/\u30fc <sp> \u63a1\u7528+\u540d\u8a5e/\u30b5\u5909\u63a5\u7d9a \u6210\u529f+\u540d\u8a5e/\u30b5\u5909\u63a5\u7d9a \u3057+\u52d5\u8a5e/\u81ea\u7acb/\u9023\u7528\u5f62/\u30b5\u5909\u30fb\u30b9\u30eb \u3066+\u52a9\u8a5e/\u63a5\u7d9a\u52a9\u8a5e \u3044\u308b+\u52d5\u8a5e/\u975e\u81ea\u7acb/\u57fa\u672c\u5f62/\u4e00\u6bb5 \u4f01\u696d+\u540d\u8a5e/\u4e00\u822c \u69d8+\u540d\u8a5e/\u63a5\u5c3e\u8f9e/\u4eba\u540d \u306f+\u52a9\u8a5e/\u4fc2\u52a9\u8a5e \u3069\u3046+\u526f\u8a5e/\u52a9\u8a5e\u985e\u63a5\u7d9a \u3057+\u52d5\u8a5e/\u81ea\u7acb/\u9023\u7528\u5f62/\u30b5\u5909\u30fb\u30b9\u30eb \u3066+\u52a9\u8a5e/\u63a5\u7d9a\u52a9\u8a5e \u3044\u308b+\u52d5\u8a5e/\u975e\u81ea\u7acb/\u57fa\u672c\u5f62/\u4e00\u6bb5 \u304b+\u52a9\u8a5e/\u526f\u52a9\u8a5e\uff0f\u4e26\u7acb\u52a9\u8a5e\uff0f\u7d42\u52a9\u8a5e \u3063\u3066+\u52a9\u8a5e/\u9023\u4f53\u5f62/\u683c\u52a9\u8a5e \u3068\u3053\u308d+\u540d\u8a5e/\u975e\u81ea\u7acb/\u526f\u8a5e \u3067\u3059+\u52a9\u52d5\u8a5e/\u57fa\u672c\u5f62/\u7279\u6b8a\u30fb\u30c7\u30b9 \u306d+\u52a9\u8a5e/\u7d42\u52a9\u8a5e <sp> \u3067+\u63a5\u7d9a\u8a5e \u30fc+\u540d\u8a5e/\u4e00\u822c/\u30fc \u306a\u3093\u304b+\u52a9\u8a5e/\u526f\u52a9\u8a5e \u30fc+\u540d\u8a5e/\u4e00\u822c/\u30fc <sp> \u306e+\u52a9\u8a5e/\u683c\u52a9\u8a5e \u3054+\u63a5\u982d\u8f9e/\u540d\u8a5e\u63a5\u7d9a \u8aac\u660e+\u540d\u8a5e/\u30b5\u5909\u63a5\u7d9a \u304c+\u52a9\u8a5e/\u4e00\u822c/\u683c\u52a9\u8a5e \u3067\u304d\u308c+\u52d5\u8a5e/\u81ea\u7acb/\u4eee\u5b9a\u5f62/\u4e00\u6bb5 \u3070+\u52a9\u8a5e/\u63a5\u7d9a\u52a9\u8a5e \u306a\u3042+\u52a9\u8a5e/\u7d42\u52a9\u8a5e \u3068+\u52a9\u8a5e/\u5f15\u7528/\u683c\u52a9\u8a5e \u601d\u3063+\u52d5\u8a5e/\u81ea\u7acb/\u9023\u7528\u30bf\u63a5\u7d9a/\u4e94\u6bb5\u30fb\u30ef\u884c\u4fc3\u97f3\u4fbf \u3066+\u52a9\u8a5e/\u63a5\u7d9a\u52a9\u8a5e \u3044+\u52d5\u8a5e/\u975e\u81ea\u7acb/\u9023\u7528\u5f62/\u4e00\u6bb5 \u307e\u3057+\u52a9\u52d5\u8a5e/\u9023\u7528\u5f62/\u7279\u6b8a\u30fb\u30de\u30b9 \u3066+\u52a9\u8a5e/\u63a5\u7d9a\u52a9\u8a5e",
"speaker": "A01M6710"
},
{
"id": "A01M6710_0615368_0621203_sp1.1",
"recording_id": "A01M6710_sp1.1",
"start": 0.0,
"duration": 5.3045625,
"channel": 0,
"text": " \u305f\u3057\u304b\u306b+\u526f\u8a5e/\u4e00\u822c \u3042\u306e+\u9023\u4f53\u8a5e \u9732\u51fa+\u540d\u8a5e/\u30b5\u5909\u63a5\u7d9a \u3092+\u52a9\u8a5e/\u4e00\u822c/\u683c\u52a9\u8a5e \u30fc+\u540d\u8a5e/\u30fc/\u56fa\u6709\u540d\u8a5e <sp> \u3042\u3063+\u611f\u52d5\u8a5e \u3054\u3081\u3093\u306a\u3055\u3044+\u611f\u52d5\u8a5e <sp> \u3042\u306e\u30fc+\u611f\u52d5\u8a5e \u3044\u308d\u3093\u306a+\u9023\u4f53\u8a5e \u3068\u3053\u308d+\u540d\u8a5e/\u975e\u81ea\u7acb/\u526f\u8a5e \u306b+\u52a9\u8a5e/\u4e00\u822c/\u683c\u52a9\u8a5e \u51fa\u3059+\u52d5\u8a5e/\u81ea\u7acb/\u57fa\u672c\u5f62/\u4e94\u6bb5\u30fb\u30b5\u884c \u3063\u3066+\u52a9\u8a5e/\u9023\u4f53\u5f62/\u683c\u52a9\u8a5e \u306e+\u540d\u8a5e/\u975e\u81ea\u7acb/\u4e00\u822c \u306f+\u52a9\u8a5e/\u4fc2\u52a9\u8a5e \u78ba\u7387+\u540d\u8a5e/\u4e00\u822c \u306f+\u52a9\u8a5e/\u4fc2\u52a9\u8a5e \u4e0a\u304c\u308b+\u52d5\u8a5e/\u81ea\u7acb/\u57fa\u672c\u5f62/\u4e94\u6bb5\u30fb\u30e9\u884c \u3093+\u540d\u8a5e/\u975e\u81ea\u7acb/\u4e00\u822c \u3067\u3059+\u52a9\u52d5\u8a5e/\u57fa\u672c\u5f62/\u7279\u6b8a\u30fb\u30c7\u30b9 \u3088+\u52a9\u8a5e/\u7d42\u52a9\u8a5e",
"speaker": "A01M6710"
}
],
"features": {
"type": "fbank",
"num_frames": 136848,
"num_features": 40,
"frame_shift": 0.01,
"sampling_rate": 16000,
"start": 0.0,
"duration": 1368.4845625,
"storage_type": "lilcom_hdf5",
"storage_path": "exp/data/fbank/train_all/feats-39.h5",
"storage_key": "6bc2e318-a256-4583-af13-3e83ac214858",
"channels": 0
},
"recording": {
"id": "A01M6710_sp1.1",
"sources": [
{
"type": "file",
"channels": [
0
],
"source": "/home/sysadmin/CSJ_RAW/WAV/core/A01M6710.wav"
}
],
"sampling_rate": 16000,
"num_samples": 21895753,
"duration": 1368.4845625,
"transforms": [
{
"name": "Speed",
"kwargs": {
"factor": 1.1
}
}
]
},
"type": "Cut"
},
OK but this time it is different. You have a 5.3s cut with two supervisions - one of them spans the whole cut, the other one is much longer - begins 9.5 seconds before and ends about 4 seconds after. So they are overlapping and it seems that they are coming from the same speaker. Are you sure that this is expected and the issue is not in the creation of the SupervisionSet?
BTW I am not sure how well the current snowfall recipes will handle overlapped speech. In principle the training should not crash, but I don't think the model will learn something meaningful.
If you are completely sure that your data is correct, then you can mitigate the current problem by doing sth like:
for cut in cuts:
cut.supervisions = cut.trimmed_supervisions
You can read what it does here: https://github.com/lhotse-speech/lhotse/blob/master/lhotse/cut.py#L52. You must do this before any mixing or padding though.
Hi,
Yes it is from the same speaker. The segments look fine and the issue is not in the creation of the SupervisionSet.
Let me try with above method to check.
Regards, Mohit
Hi @pzelasko,
I am finally started with the training and it is running fine till now.
The changes in validation function are looking good.