Multitrack Files
Discussed in https://github.com/kaixxx/noScribe/discussions/239
Originally posted by sebastian-stix October 29, 2025 I don't need speaker identification, since I export multitrack audio for Auphonic, which already gives me perfectly separated speakers. Once this feature is available, noScribe will definitely be my first choice.
Thank you for the suggestion. The use case with Auphonic might be a bit special, in particular since Auphonic already comes with its own transcription solution, as far as I can tell. However, we have a similar situation with Zoom recordings. So, I might consider this idea for a future version of noScribe. Unfortunately, it might require some non-trivial changes in the architecture of the whole app. So, I don't want to promise anything.
Great! I'm using reaper (ultraschall.fm) and export Multitrack by default. If noScribe can create transcripts from that, this will replace auphonic for me completely.
Hey, I have used OBS Studio for some of my interview recordings and also put me and the interviewee on different (stereo) audio tracks. I then use ffmpeg do merge those into one stereo track:
ffmpeg.exe -i "myinputwith2stereoaudiotracks.mp4" -filter_complex "[0:a:0]pan=mono|c0=FL[a1];[0:a:1]pan=mono|c0=FL[a2];[a1][a2]amerge=inputs=2[aout]" -map 0:v -map "[aout]" -c:v copy -c:a aac -ac 2 -b:a 192k "myoutputwithonepannedstereotrack.mp4"
Since noScribe already utilizes ffmpeg for an audio extracting step, that seems to be just one simple switch to enable mixing down a two track file to a single stereo track one. Maybe the detection for the occurrence of a two track input can be automatted with a hint how to downmix more than two input tracks manually.
You are right that would be feasible. However, the long-term goal of noscribe would be to be able to transcribe the different files / track independently and get perfect speaker diarization on this way.
However, adding a hint for manual mixing would be great.
You are right that would be feasible. However, the long-term goal of noscribe would be to be able to transcribe the different files / track independently and get perfect speaker diarization on this way.
That would be much better, indeed. But until that, an in-built downmix or a hint how to do that manually would save people from having to know or figure out how to achieve that. (I have that knowledge, but nearly none of my students do.) UI-wise that could be done as a warning or an error at the audio-extraction step, when more than one track is detected by ffmpeg: "That file has more than one audio track. Only the first one will be transcribed with the current version of noScribe. If you want to downmix the tracks into one, you can do that with our shipped ffmpeg an the following command: THE COMMAND"
I'm a little confused by "... mixing down a two track file to a single stereo track one." Do you mean mixing it down to a mono track? This would indeed be very good. I think this also doesn't need a switch or a separate message, but should be the default behavior.
That is, of course, until we manage to transcribe both tracks separately and get perfect speaker separation. I agree that this would be the ideal solution, but it may require some fundamental changes to the internal architecture of noScribe.
Do you mean mixing it down to a mono track?
I mixed my two audio tracks to one stereo track, where the first track is the left channel and the second is the right channel. This is what my ffmpeg command does. That way I still had a clear separation that I hoped would help to separate the speakers. If that is not the case, then a mono mix sounds like a better solution.
Ah, I see. So if you have a stereo track with one speaker in the left channel and one in the right, noScribe is transcribing it correctly? This is what I was hoping, but never really tested.
I understand now that you have two separate files that you want to combine into one transcript. This clashes a little bit with the default behavior implemented in the upcoming version 0.7: If multiple files are selected, multiple transcription jobs will be created and placed in the queue (which is also new in 0.7). I don't want to change this now, since we are very close to the release. But I might consider adding a dialog in the future, asking whether multiple files should be combined into one transcript or not.
I understand now that you have two separate files that you want to combine into one transcript.
No, not two files, just one mp4 file with multiple audio tracks. One for me (local, straight from the mic) and one from the interviewee (coming from PC audio out of the video conference). In OBS Studio you can record your audio sources to seperate tracks in the output file.
From those multiple tracks, noScribe only uses the first at the moment. So a downmix to one (stereo) track has to be done prior.
My resulting audio with one speaker left and the other right worked quite well then.
Wouldn't it be easier to make two (or more) separate transcripts by for every track and then combine the outputs outside of noScribe?
Wouldn't it be easier to make two (or more) separate transcripts by for every track and then combine the outputs outside of noScribe?
What do you mean by easier? It would roughly take double the processing time, also a preparation of the input file and adds a (in case of overlapping speakers messy) manual merging step for the resulting transcript. Versus a quite well working speaker detection on the downmix.
All my other files (coming directly from Teams with its useless transcription) had a stereo downmix. The transcription with noScribe was much better, despite it does not have access to the isolated speaker streams. The separation is cleaner in Teams, but the transcription is so bad that it twists the meaning of what is said in many places. Astonishing.
To me it doesn't make sense, if you already have you're speakers in two different files (or tracks, which is basically the same) and then merge them just to have them separated by AI again or do I get anything wrong here? You can disable the speaker recognition, which should speed things up a lot, if you already have them separated.
And of course the merging shouldn't happen manually, but programmatic using the time codes.
I accept the loss of accuracy here and it is working good enough. I'm always looking for a simple to explain workflow so that I can tell my students how to use it on their own. An extra step to merge two transcripts together adds complexity to an already overwhelming task. Maybe I'm just missing a simple and automatically working tool for that.
(Too) many of my students even don't know how to work with a real computer, as they are used only to iPads that were their workhorses at school. So even the splitting up or downmix of the multitrack audio with ffmpeg looks like wizardry for those students. (I'm aware that we ourselves in the late 1990s were able to figure that stuff out even without anyone teaching us and without well written tutorials on the internet. And that that need made us the mature users we are now.)
My intent for the request was the workflow in any DAW-Software:
Example-Workflow: I'm editing and mixing a multitrack-production in the DAW-Software. Track 1: Speaker 1 Track 2: Speaker 2 Track 3: Atmo Track 4: Music Track 5: External (Phone)
If I try to transcribe the Mastermix, there are Atmo, Music and mastering-Effects -> Bad result, useless transcript.
But I can render the production to separate Tracks AND the Mastermix: Output-Files: Track 1, Track 2, Track 3... & Mastermix as well
For noScribe, I would use only Track 1,2,3,5 to create the transcript and will get a perfect result. Merging by noScribe is useless when using a DAW.
The key is to merge separate transcripts (each track) into one output-file with correct speakernames.
This workflow is used in any DAW-Software. We already have separate Tracks - Let's use them for almost perfect results.
We already have separate Tracks - Let's use them for almost perfect results.
I think we all think so. Don't get me wrong.
Until that is possible directly in noScribe (using multiple tracks or seperate input files for each speaker leading to one transcript), a little warning or how to perform a downmix would IMHO be better than just silently only transcribe the first audio track. Or a hint how to extract the tracks and later merge two transcriptions into one in a simple way.
From a UX perspective it IMHO would be best to detect a multitracked input file and ask how to handle that:
- Treat as every track contains one speaker and provide a track selector with a speaker name field that will be used in the transcript (-> skip the detection at all, perform n runs and merge to a single transcript automatically)
- Ignore all but the first track (->current behavior, maybe better provide a track selector to allow multiple runs on the same input file, as long as option 1 ist not yet implemented)
- Provide a track selector and perform a downmix of selected tracks (-> build and add a complex-filter option for ffmpeg and use it in the audio extraction step, which would only be useful in edge cases, when option 1 is available)
I agree with almost everything said in this discussion. As a result, I see a short term and a long term goal:
- Short term: Downmix any multitrack file (not only stereo ones) so that noScribe transcribes all tracks, not only the first one. @spackmat: Could you provide an example multitrack file so I can reproduce the issue?
- Long Term: Use multiple tracks for perfect speaker separation. Problem is that noScribe is currently transcribing the input file as a continuous stream, logging the results on screen and writing it to the output file. Doing this with multitrack audio would require to transcribe multiple tracks in parallel and joining the output on the fly. This is not easily possible with whisper as far as I can see. The only option would be to transcribe one track after the other and join the output later. This, however, requires some fundamental changes in the internal architecture of noScribe.
Could you provide an example multitrack file so I can reproduce the issue?
Sure, here it is (from a german Teams test call recorded with my phone interview settings).
https://github.com/user-attachments/assets/e50320ef-6675-4b14-b731-658b48a61877
ffmpeg shows you what tracks are in a file, so one can use that output to detect a multitracked input file. My ffmpeg command from above contains a possible filter setting that does the downmix of the two tracks into one stereo track with the first track on its left channel and the second track on the right channel. That is a special case, as more than two channels should downmix all tracks to one mono track.
@kaixxx There is already a PR regarding the ffmpeg code (#236 ). What do you think about merging the PR after the next release and then I will try to incorporate a user warning / suggestion?
Thank you for the example, @spackmat. I was a bit puzzled that it doesn't work, since in the current ffmpeg command, I'm already combining all channels into one mono file with -ac 1.
But I have now learned that there is a difference between multiple channels and multiple tracks in one audio file. This was my confusion from the very beginning. Downmixing all these tracks into a single one is a little trickier. So yes, @mutlusun, it is a good idea to combine this with the switch to PyAV. Thank you. Remember, however, that we must replicate this in the noScribe editor as well, or people will not be able to listen to the audio later.