wscribe-editor
wscribe-editor copied to clipboard
Request: better handling of transcripts that have already been diarized.
Sorry to post this in "Issues", but it doesn't appear that a "Discussions" option is available on this repository.
It would be great if the project could handle speech that has already been diarized. Right now, it presents everything as one contiguous block. (SRT mode 'sort of' handles it better because it appears to break on each element of the JSON list, which often (but not always) also corresponds with diarization.)
I've only really played around with it on your demo page, but it appears to load the script based on the 'word' key in the segment list elements (and is otherwise ignore the 'text' key.)
[
{
"text": " I'd like to start by asking you to tell me about your experiences with.",
"start": "00:00:02.500",
"end": "00:00:07.620",
"score": 0.7476367237578648,
"words": [
{
"text": "I'd",
"score": 0.763916015625,
"start": "00:00:02.921",
"end": "00:00:03.022"
},
{
"text": "like",
"score": 0.86474609375,
"start": "00:00:03.062",
"end": "00:00:03.222"
},
{
"text": "to",
"score": 0.9033203125,
"start": "00:00:03.262",
"end": "00:00:03.343"
},
{
"text": "start",
"score": 0.8740234375,
"start": "00:00:03.383",
"end": "00:00:03.644"
},
...
This is from a two-person interview. And ideally it would look something like this:
I would propose one of the following to accommodate 'pre-diarized' audio:
- The word starting a change of speaker could be pre-appended with a identifier: For example, I've put "\nSpeaker 1:" in front of the first word spoken by Speaker 1.
{
"text": "\nSpeaker 1: I'd",
"score": 0.763916015625,
"start": "00:00:02.921",
"end": "00:00:03.022"
}
It does appear that on export the newline character is preserved, but is currently not represented in the edit page upon load. Perhaps that's a quick-and-dirty fix ???
- Or perhaps a more robust integration with RTTM data such that the imported JSON has a speaker ID flag. Perhaps like:
{
"text": "I'd",
"speaker":1,
"score": 0.763916015625,
"start": "00:00:02.921",
"end": "00:00:03.022"
}
or
{
"text": "I'd",
"speaker":"John",
"score": 0.763916015625,
"start": "00:00:02.921",
"end": "00:00:03.022"
}
I lean toward adding a 'speaker' key to the structure as it seems more robust and likely would make it easier to maintain the segmentation when exporting the edited results than relying on embedded newline characters which could easily be lost. The presence of 'speaker' key in either the word level entry or the higher top level would then trigger paragraph breaks and maintain them upon export.
The downside is that if some text is misattributed, the editor should be able to handle reassignment to the correct speaker which adds some complications. My off the cuff thoughts would be that if diarization exists in the json, the default 'transcript' view changes from one large DIV into seperate DIVs based on speaker ID much like is currently done when selecting SubTitle mode. Thus if a word at the end of a sentence is misattributed, it could be 'cut-and-pasted' into the other block. Or if necessary something similar for 'add new segment could be added for 'add new speaker'.
Cool project. Thanks for sharing. I'm going to dig into the code to see if there's something I can do to help in this regard.