kaldi
kaldi copied to clipboard
make_rttm.py issue
Here's a MWE of a problem in callhome_diarization/v1/local/make_rttm.py. The problem is when it tries to merge overlapping segments: it cannot deal with certain cases where a segment is a subsegment of another (e.g., if a speaker briefly speaks early on within a longer speaker segments). Specifically, given two consecutive utterances u1 and u2, if (u1.end + u2.begin)/2 > u2.end, there will be a negative time mark in the resulting RTTM.
$ cd $KALDI/egs/callhome_diarization/v1
$ echo -e "utt1 reco1 0 10\nutt2 reco1 3 5" > tmp.segments
$ echo -e "utt1 0\nutt2 1" > tmp.labels
$ cat tmp.segments
utt1 reco1 0 10
utt2 reco1 3 5
$ cat tmp.labels
utt1 0
utt2 1
$ ./diarization/make_rttm.py tmp.segments tmp.labels tmp.rttm
$ cat tmp.rttm
SPEAKER reco1 0 0.000 6.500 <NA> <NA> 0 <NA> <NA>
SPEAKER reco1 0 6.500 -1.500 <NA> <NA> 1 <NA> <NA>
The script egs/wsj/s5/steps/segmentation/convert_utt2spk_and_segments_to_rttm.py
might be applicable to your case. You can use SCTK's rttmSmooth.pl -s 0
to merge nearby same-speaker segments if needed.
@mmaciej2 you might want to look into this at some point.
@mmaciej2 when you have a chance, could you think about what we should do with this issue?
@dsmiller It seems to me that the issue here is that you are misunderstanding the usage of this script. I will clarify what this particular make_rttm.py script is for. In a sense, it is to remove "fuzzy" speaker change boundaries.
What this script is for is to produce an rttm file from a sliding-window diarization system. More specifically, the way it is "handling overlap" is to place hard speaker boundaries where we detect a speaker change (i.e. two adjacent segments have different speaker labels). But, due to using a sliding window, there will be overlap between the adjacent segments, which comes about not because of any detected overlapping speech, but just as an artifact from the sliding-window method.
A reason the script fails in the case you described is because it is somewhat nonsensical setup for this script's purpose. The script is designed to produce output that contains no overlapping speech, and it is unclear what the correct way to handle a segment being entirely contained within another segment would be.
I ran into this problem while creating a diarization test set. I had multiple single-channel files which, each one side of a conversation (like much of LDC's data). So to create a diarization test set I mixed the audio channels back together, and used VAD on the individual channels to create reference labels (some of which overlap or are proper subsegments). So the labels needed to be dropped or altered.
I solved the problem by dropping segments that were proper subsegments. But I assume other people will find themselves in similar situations, it may be useful to have a robust script for this purpose.
I'm not entirely sure what it is you are trying to do. I have created diarization test sets by mixing individual channels in the past, and created the "ground truth" rttm file with some very basic text processing. I did not do any special processing—it was essentially just concatenating the different channel label references together and converting it into the rttm file format. Is there some kind of segment processing you want to do?
Hi @mmaciej2,
I run into this issue after I used extract_xvectors.sh
The resulted segments
have overlapping parts.
I think it is because of the subsegments
have overlapping parts.
They were created by get_uniform_subsegments.py
cmd. See https://github.com/kaldi-asr/kaldi/blob/master/egs/callhome_diarization/v1/diarization/nnet3/xvector/extract_xvectors.sh#L108
The segments
were copied from subsegments_data
at the end of extract_xvectors.sh
script
https://github.com/kaldi-asr/kaldi/blob/master/egs/callhome_diarization/v1/diarization/nnet3/xvector/extract_xvectors.sh#L144
@oplatek,
The output of extract_xvectors.sh should produce overlapping segments. The extract_xvectors.sh script (and more specifically the get_uniform_subsegments.py script) take in non-overlapping segmentation (i.e. from speech activity detection) and produces overlapping subsegments.
The make_rttm.py script will break if it is given a segment that is entirely contained within another segment. This should not be possible with the output of extract_xvectors.sh unless the input to that script is incorrect. It is very important that the input segmentation to extract_xvectors.sh reflects true speech activity detection segmentation rather than ASR segmentation, where you can have overlapping segments due to multiple people speaking. But in general we should not have that information for diarization, since that is part of the diarization task.
@mmaciej2,
Would it be possible to add a test for this bad input in make_rttm.py? You could print out an error explaining why it's failing. I think a few people have run into this issue now.
@mmaciej2 I double checked that I segments on the input does not overlap.
I do not understand what other requirements is needed. Can you pls explain in more detail?
I checked it by snippet where:
- a
name
is name of recording, -
(xs, xe)
are start and end times of segmentx
, -
(ys, ye)
are start and end times of segmenty
-
dsesgs
is a dictionary of list of all segments per recording
In [35]: for name, lst in dsegs.items():
...: for xs, xe in lst:
...: for ys, ye in lst:
...: if (xs < ys and ys < xe) or (xs < ye and ye < xe):
...: print('overlap for ', name, xs, xe, ys, ye)
@oplatek,
As far as I know, if there is no overlap in the input, it shouldn't be producing incorrect output.
Can you share some of the problematic output and the corresponding input that created it?
@mmaciej2 thank you for the help I updated the script to validate the input.
It pointed me to the fact that I have a lot of consecutive segments. For example: (26.12, 26.66) and (26.66, 27.56) (35.04, 35.52) and (35.52, 36.53)
In [47]: for name, lst in dsegs.items():
...: for i, (xs, xe) in enumerate(lst):
...: for j, (ys, ye) in enumerate(lst):
...: if ((xs <= ys and ys <= xe) or (xs <= ye and ye <= xe)) and i != j:
...: print('overlap for ', name, xs, xe, ys, ye, i, j)
...
('overlap for ', 'test21wav', 26.12, 26.66, 26.66, 27.56, 4, 5)
('overlap for ', 'test21.wav', 26.66, 27.56, 26.12, 26.66, 5, 4)
('overlap for ', 'test21.wav', 35.04, 35.52, 35.52, 36.53, 7, 8)
('overlap for ', 'test21.wav', 35.52, 36.53, 35.04, 35.52, 8, 7)
Should it be consecutive segments considered a valid input or not?
FYI: My problem was I was trying to represent multiple speakers in utt2spk
which lead me to have utterances
prefixed by spkID
.
As a consequence rttm
files were not sorted according to timestamps which is expected by make_rttm
. Every single time the timestamps were not in order the negative duration emerged.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I need help, I want to convert .lab to rttm i am not able to run make_rttm.py
This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.