cdp-backend Speaker classification

Feature Description

Backend issue for the relevant roadmap issue

Adding speaker classification to CDP transcripts. This could be through a script/class that retroactively attaches the speaker name to a transcript that already has speaker diarization enabled. Prodigy can be used for annotating the training data.

Use Case

With speaker classification we can provide transcripts annotated with the speaker. This can be used in many ways such as through a script or github action

Solution

Very high level idea would be to:

Use GCP's built-in speaker diarization to separate the speakers. We could also create our own audio classification model. We could also use something like Prodigy to annotate the data, but I believe they have their own diarization/transcription models as well.
Figure out how to add the classified speaker names to the diarized transcript. I'm not sure if GCP allows you to provide any training data, but from what I could tell they only separate the speakers, but the models don't take in training data to label them.

A bigger picture breakdown of all the major components can be found on the roadmap issue under "Major components".

Nov 05 '21 01:11 isaacna

@JacksonMaxfield feel free to add any thoughts/ideas!

Nov 05 '21 01:11 isaacna

I think the most difficult part will be attaching identities to transcripts that are already speaker differentiated

Nov 05 '21 02:11 isaacna

Use GCP's built-in speaker diarization to separate the speakers. We could also create our own audio classification model. We could also use something like Prodigy to annotate the data, but I believe they have their own diarization/transcription models as well.

We can't use speaker diarization. It has a limit of 8 speakers because it uses some clustering under the hood or they are just arbitrarily imposing some limit.

The best method in my mind would be to fine tune a speech classification model with our own labels (gonzalez, sawant, mosqueda, etc).

https://huggingface.co/superb/wav2vec2-base-superb-sid

Here is a decent notebook on fine tuning and using the hugging face / transformers API for fine tuning an existing transformer: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/audio_classification.ipynb

The idea in my head is:

collect ~50 clips of each council member speaking
to create a dataset, create random slices of each clip that are say 5 seconds in length each
This should blow up your 50 clips * council members * 5 second random spans

(I am arbitrarely choosing 5 seconds here because I don't know how long of an audio clip these pretrained audio classifiers allow in / how much memory is needed to train -- if we can fit a 30 second audio clip in then just do that.... and replace any reference I make to 5 seconds with 30 seconds -- find the audio clip size that fits in memory and performs well)

evaluate the model. until we have a workable model that accurately predicts which council members are speaking at any given time say 98%+ of the time?? then we don't really care about how it's applied.

if we don't hit the mark we may need more and more varied data.

BUT.

if we were to talk about how to apply this model into production I would say:

take a transcript, loop over each sentence in the transcript, if the sentence timespan is under 5 seconds, just run it through the trained model.

if the sentence timespan is over 5 seconds, chunk the sentence into multiple 5 second spans (or close to that), then predict all chunks and return the most common value predicted from the group as the sentence speaker, which if our sentence splits are accurate (which they generally are from what I have seen) should just be acting as a safety measure.

thoughts?

Nov 05 '21 05:11 evamaxfield

We can't use speaker diarization. It has a limit of 8 speakers because it uses some clustering under the hood or they are just arbitrarily imposing some limit.

From the code example here they set the speakers to 10 so it may be higher? But if there is any limit at all then yeah I agree it's probably just better to create our own model

Here is a decent notebook on fine tuning and using the hugging face / transformers API for fine tuning an existing transformer: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/audio_classification.ipynb

This seems like a really good resource! Just from skimming it seems pretty detailed and straightforward.

if the sentence timespan is over 5 seconds, chunk the sentence into multiple 5 second spans (or close to that), then predict all chunks and return the most common value predicted from the group as the sentence speaker, which if our sentence splits are accurate (which they generally are from what I have seen) should just be acting as a safety measure.

I'm assuming there's some library that makes it easy to split the audio file into chunks based on timestamp? Also, I do think this puts faith in sentence split accuracy. One thing we could consider is also enabling GCP speaker diarization and use that to cross check the sentence splits? We may not need it, and we wouldn't be using the diarization for identification necessarily, but it could help making which audio clips we feed to the model higher quality.

Nov 05 '21 07:11 isaacna

Also once we find the speaker are we writing it to the existing transcript as well (I see that our current transcript has speaker_name as null from the diarization we opt out of)?

Or did you only want to create a separate output that stores a speaker to clip relation like you mention in the roadmap issue?

Nov 05 '21 07:11 isaacna

From the code example here they set the speakers to 10 so it may be higher? But if there is any limit at all then yeah I agree it's probably just better to create our own model

My main concern is the using google will cost way more than just training and applying our own model that we can apply during a GH action.

I'm assuming there's some library that makes it easy to split the audio file into chunks based on timestamp? Also, I do think this puts faith in sentence split accuracy. One thing we could consider is also enabling GCP speaker diarization and use that to cross check the sentence splits? We may not need it, and we wouldn't be using the diarization for identification necessarily, but it could help making which audio clips we feed to the model higher quality.

I agree that it puts faith into our sentence split accuracy. I think the interesting test will be to see how many sentences that are longer than 5 seconds and we do the chunking on and then the prediction on each chunk result in classifications of multiple people for the same sentence.

Also once we find the speaker are we writing it to the existing transcript as well (I see that our current transcript has speaker_name as null from the diarization we opt out of)?

I would store it to the transcript at speaker_name but again, this is all "application" after we evaluate whatever model we have.

Nov 05 '21 14:11 evamaxfield

My main concern is the using google will cost way more than just training and applying our own model that we can apply during a GH action.

That makes sense, we should try to keep the cost as low as possible

I think the interesting test will be to see how many sentences that are longer than 5 seconds and we do the chunking on and then the prediction on each chunk result in classifications of multiple people for the same sentence.

Yeah, we could do some testing and adjust the time interval too if 5 seconds doesn't seem to be accurate

Nov 06 '21 02:11 isaacna

Adding more here.

Don't worry about it now but just think about it:

I think the problem will become: "how little training data can we use to fine tune a pre-existing model to get 98% accuracy or better" and "how can we create this model fine tuning system in the cookiecutter"

In my head, the cookiecutter should have a folder in the Python dir that is called like "models" and then subdirs in there that is each custom model we have. So:

instance-name/
    web/
    python/
        models/
            speaker-classification/
                data/
                    {person-id-from-db}/
                        0.mp3
                        1.mp3
                    {person-id-from-db}/
                        0.mp3
                        1.mp3

Then we have a CRON job that on push it trains from this data? But this depends on how much data is needed. If a ton of data is needed for training then I don't think we should push all of these mp3s into the repo.

We could construct a config file in the speaker-classification folder. I.e.

{
    "{person-id-from-db}": {
        "{some-meeting-id}": [
            {"start-time": 0.0, "end-time": 3.2},
            {"start-time": 143.6, "end-time": 147.3}
        ],
        "{different-meeting-id}": [
            {"start-time": 1.4, "end-time": 3.2},
            {"start-time": 578.9, "end-time": 580.0}
        ]
    },
    "{different-person-id-from-db}": {
        "{some-meeting-id}": [
            {"start-time": 44.3, "end-time": 47.1},
            {"start-time": 222.5, "end-time": 227.2}
        ],
        "{different-meeting-id}": [
            {"start-time": 12.3, "end-time": 14.5},
            {"start-time": 781.1, "end-time": 784.2}
        ],
    }
}

And the training process pulls the audio files, finds those snippets, cuts them, then trains from them.

But what I am getting at is that while we can figure out and fine tune a model all we want, figuring out how to deploy this system to each instance will be interesting.

Nov 08 '21 18:11 evamaxfield

Additionally, maybe we store the "audio classification" accuracy metric in the database in the metrics collection / metadata collection that we discussed #117.

So that we can have a page of stats on each instance. /models and see the accuracy of all of our models for that instance?

Nov 08 '21 18:11 evamaxfield

Still.... just try to get any fine tuning of the pre-trained model working and evaluated. Application doesn't matter if the system just doesn't work in the first place.

Nov 08 '21 18:11 evamaxfield

Going to update this issue because there has been a lot of progress on this and I want to start getting ideas about how to move to production.

I am going to keep working on the model, definitely need to annotate a few more meetings to get a more diverse dataset for training and eval. But other things we need to do / consider.

As @tohuynh brought up: "how do we train these models for people?" I think I have briefly mentioned that I think one method for doing this will be to leave directions in the "admin-docs" section of an instances repo: https://github.com/CouncilDataProject/seattle-staging/tree/main/admin-docs

That explains to store the gecko annotation files in the repo in some directory. There needs to be a github action that the maintainer can manually kick off to run a "pre-dataset eval", then if the dataset passes pre-eval, training, then after train, storage of the model.

What I mean by pre-eval is basically answering @tohuynh's comments regarding:

is there simply enough data
when we split into train, test, and validation sets are there enough meetings to actually create a good dataset split for traing and evaluation
when we split into train, test, and validation sets, does each set have a good balance of each speaker
all the speaker names / labels provided must be the same as found in the database. Not all names in the database need to be in the annotated dataset but the annotated dataset labels must all be in the database.

if all of those questions "pass" the github action can move onto training a model. I am not sure if we can train a model using the 6 hour CPU time given to use by GitHub actions, especially with loading the model, loading the audio files into memory, storing the model, and all that stuff, so, to speed up training and actually make it possible, we may want to figure out how to kick off this training on google cloud.

Once the model is trained, store it to the instances file store, but also report the evaluation as a part of the github action logs so the maintainer doesn't need to check some other resource for that info.

If the model meets some accuracy threshold we can automatically flag it as "production" maybe and then store the model info in the metadata collection on firestore to use during the event gather pipeline?

Also, in my function to apply the trained model across a transcript, I already have a "threshold" value that says the predicted label for each sentence must be at least 98.5% confidence. If it's below that threshold we simply store speaker_name as None. I may drop this to just 98% but we can argue about that :shrug:

Transcript format should also have an annotation added that store the speaker classification model metadata. Speakerbox version, validation accuracy and maybe each sentence should have a "speaker_confidence" stored as well :shrug:

Some frontend thoughts: For transcript format, I would personally like to store the speaker_name as the person's actual name, not the speaker id. Is that okay @tohuynh @BrianL3? On the frontend I know we have the person picture but it would also be good to show their name imo.

Similarly, I am wondering if it's possible to have like a "report incorrect speaker label" button or something? Not sure how that would work.... but something to consider. But because of this, its a tie into: https://github.com/CouncilDataProject/cdp-frontend/issues/141

(related to above, I am somewhat missing my "show dev stats" button that was available way back on v1 of CDP... was useful for showing like transcript confidence and now would be useful for show each sentence speaker confidence and so on haha)

Feb 07 '22 03:02 evamaxfield

is there simply enough data

If the required number of training examples is too large, we can try audio augmentation by slightly changing the audio while preserving the speaker label in order to create new training examples from collected examples. Sorta like changing the lighting on an image to create more images. Not sure yet about what transformations could be done to audio.

For transcript format, I would personally like to store the speaker_name as the person's actual name, not the speaker id. Is that okay @tohuynh @BrianL3? On the frontend I know we have the person picture but it would also be good to show their name IMO.

It would be nice to have all three - id, name, picture. We need the id, if we want to link the person (who is also in the DB) to the person page.

Feb 07 '22 18:02 tohuynh

Another thing to add to the documentation is that the training examples should be typical and/or representative examples(like all members present, typical recording environment), which is obvious to us, but in just case.

Feb 07 '22 18:02 tohuynh

If the required number of training examples is too large, we can try audio augmentation by slightly changing the audio while preserving the speaker label in order to create new training examples from collected examples. Sorta like changing the lighting on an image to create more images. Not sure yet about what transformations could be done to audio.

Yep. I think this can easily be done during the training process. Most easily I can just increase the chunking. Having different lengths of chunks that get padded or truncated, etc.

It would be nice to have all three - id, name, picture. We need the id, if we want to link the person (who is also in the DB) to the person page.

I can add a field for the speaker_id to the transcript model but i feel like adding a field for the speaker_image_uri is a bit overkill imo.

Feb 07 '22 18:02 evamaxfield

That's fine if we don't want the person's picture next to their name. It would save some time to have the URL right away instead of having to query for the person, populate the file_ref, download the URL from gs URI to display the image.

Feb 07 '22 18:02 tohuynh

That's fine if we don't want the person's picture next to their name. It would save some time to have the URL right away instead of having to query for the person, populate the file_ref, download the URL from gs URI to display the image.

Yea I don't think we need the person's photo. But can we test out the load times when we are ready to test this? A part of me says that because we are already pulling person info as a part of the voting record for the meeting it should be adding too much time but more time from what it already is could be a problem.

Feb 07 '22 18:02 evamaxfield

The load times, if the renderable URL is stored, would be the load times of fetching all council members' photos (for that transcript). And if the council member speaks again later in the transcript, the browser would use the cache.

The load times, if the renderable URL is not stored, would be too long.

I agree that a person's photo is not absolutely necessary (just makes it look a little nicer). Edit: I'm OK with not displaying the photo

Feb 07 '22 18:02 tohuynh

Coming back to this to leave notes of what I plan on doing in the coming weeks:

[ ] Function to move from diarized audio dirs to gecko JSON (allows users to make a training set with the diarization process then only store the JSON instead of all the chunks)
[ ] Function in cdp-backend to kickoff check the data, kickoff a training pipeline, and store the model -- i am thinking these will be tied a github issue. similar to our deployment bots the repository can have a "Train model" issue that takes some parameters does the work then posts the results to the issue.
- note: checking data means that all people in the training set are in the CDP database?? or is it fine if they aren't?? (Research wise it may be nice to not care about if they are in the database or not... i am tempted to have a "speaker_name" and "speaker_id" field in the transcript annotation model to better track this, speaker_id would be optional.
[ ] GitHub Action for training and storage
[ ] GitHub Action for application to previous transcripts / backfilling
[ ] Integrate into main cdp-backend event gather pipeline

Jun 22 '22 20:06 evamaxfield