lhotse Problem with CutSet.from

Hi, I'm having a problem with the from_manifest function in the CutSet class.

I've decomposed a CutSet manifest using the CutSet.decompose() function so as to obtain the 3 files "features", "recordings" and "supervisions", with the aim of modifying the "supervision" file and then regenerating the CutSet file.

The problem occurs when I try to recompose these three files with the CutSet.from_manifests function, I get the following error : Traceback (most recent call last): File "local/recompose_manifest.py", line 97, in main() File "local/recompose_manifest.py", line 86, in main cut_set = CutSet.from_manifests(recordings=recordings, supervisions=supervisions, features=features) File "/home/despres/miniconda3/envs/k2_2312/lib/python3.8/site-packages/lhotse/cut/set.py", line 352, in from_manifests return create_cut_set_eager( File "/home/despres/miniconda3/envs/k2_2312/lib/python3.8/site-packages/lhotse/cut/set.py", line 3003, in create_cut_set_eager recording=recordings[feats.recording_id] if rec_ok else None, File "/home/despres/miniconda3/envs/k2_2312/lib/python3.8/site-packages/lhotse/audio/recording_set.py", line 389, in getitem return next( StopIteration

This function works without a problem if I pass any subset of only 2 files as parameters ("supervision+features", "features+recordings", "supervisions+recording").

Is it a bug, or is this function simply not designed for it?

If not, is there another way of regenerating this CutSet file without having to regenerate the features?

Thank you very much for your time.

Dec 18 '23 14:12 juliendespres

I don't think decompose was ever tested in this way, although I would have expected it to work. I'm afraid I don't have enough time right now to look into it myself. Generally you should be able to create a CutSet from 2 components (e.g. features + supervisions) and then manually attach the third one (e.g. recordings) in a for loop. If you happen to find what is the issue, please share it with us.

Dec 20 '23 19:12 pzelasko

Thank you for you response. I'm not sufficiently proficient in Python to do this kind of trick, but I ended up easily replacing the content of the text tag in the jsonl manifest with a simple perl script.

However, this feature seems to me to be essential to avoid having to regenerate features every time you change a comma in the supervision texts, and it would be interesting to be able to do this simply in future Lhotse developments.

Dec 23 '23 11:12 juliendespres

Thanks, you're right. I'll keep the issue open for now.

Dec 24 '23 03:12 pzelasko

I have the same issue. I'm doing this for the purpose of undoing trim_to_supervisions.

Jan 10 '24 17:01 RuABraun

Seems to be because features doesn't have a recording_id (or anything else that knows what cut it was a part of).

Jan 10 '24 18:01 RuABraun

Features does have recording_id field. If you can provide some way to reproduce with a small dataset like yesno or mini Librispeech I can look into it.

Jan 12 '24 15:01 pzelasko

lhotse
lhotse copied to clipboard

Problem with CutSet.from_manifests

lhotse lhotse copied to clipboard

Problem with CutSet.from_manifests

lhotse
lhotse copied to clipboard