lhotse icon indicating copy to clipboard operation
lhotse copied to clipboard

Problem with CutSet.from_manifests

Open juliendespres opened this issue 1 year ago • 6 comments

Hi, I'm having a problem with the from_manifest function in the CutSet class.

I've decomposed a CutSet manifest using the CutSet.decompose() function so as to obtain the 3 files "features", "recordings" and "supervisions", with the aim of modifying the "supervision" file and then regenerating the CutSet file.

The problem occurs when I try to recompose these three files with the CutSet.from_manifests function, I get the following error : Traceback (most recent call last): File "local/recompose_manifest.py", line 97, in main() File "local/recompose_manifest.py", line 86, in main cut_set = CutSet.from_manifests(recordings=recordings, supervisions=supervisions, features=features) File "/home/despres/miniconda3/envs/k2_2312/lib/python3.8/site-packages/lhotse/cut/set.py", line 352, in from_manifests return create_cut_set_eager( File "/home/despres/miniconda3/envs/k2_2312/lib/python3.8/site-packages/lhotse/cut/set.py", line 3003, in create_cut_set_eager recording=recordings[feats.recording_id] if rec_ok else None, File "/home/despres/miniconda3/envs/k2_2312/lib/python3.8/site-packages/lhotse/audio/recording_set.py", line 389, in getitem return next( StopIteration

This function works without a problem if I pass any subset of only 2 files as parameters ("supervision+features", "features+recordings", "supervisions+recording").

Is it a bug, or is this function simply not designed for it?

If not, is there another way of regenerating this CutSet file without having to regenerate the features?

Thank you very much for your time.

juliendespres avatar Dec 18 '23 14:12 juliendespres

I don't think decompose was ever tested in this way, although I would have expected it to work. I'm afraid I don't have enough time right now to look into it myself. Generally you should be able to create a CutSet from 2 components (e.g. features + supervisions) and then manually attach the third one (e.g. recordings) in a for loop. If you happen to find what is the issue, please share it with us.

pzelasko avatar Dec 20 '23 19:12 pzelasko

Thank you for you response. I'm not sufficiently proficient in Python to do this kind of trick, but I ended up easily replacing the content of the text tag in the jsonl manifest with a simple perl script.

However, this feature seems to me to be essential to avoid having to regenerate features every time you change a comma in the supervision texts, and it would be interesting to be able to do this simply in future Lhotse developments.

juliendespres avatar Dec 23 '23 11:12 juliendespres

Thanks, you're right. I'll keep the issue open for now.

pzelasko avatar Dec 24 '23 03:12 pzelasko

I have the same issue. I'm doing this for the purpose of undoing trim_to_supervisions.

RuABraun avatar Jan 10 '24 17:01 RuABraun

Seems to be because features doesn't have a recording_id (or anything else that knows what cut it was a part of).

RuABraun avatar Jan 10 '24 18:01 RuABraun

Features does have recording_id field. If you can provide some way to reproduce with a small dataset like yesno or mini Librispeech I can look into it.

pzelasko avatar Jan 12 '24 15:01 pzelasko