Transcript Speaker Detection isn't perfect
still an issue: https://twitter.com/KrisTemmerman/status/1716507884656427469
Scott is an announcer on some of them. Likely related to the regex
If we have a way to populate the transcript data locally, I could track down these issues.
Some more details: https://github.com/syntaxfm/website/issues/1562#issue-2141201268
@themisterholliday I think I can get you a DB dump if you are still interested?
Some more details: #1562 (comment)
@themisterholliday I think I can get you a DB dump if you are still interested?
Yep I'll take a look if you can grab that 👍
Emailed ya. Some details:
Here is where we actually append the speaker names: https://github.com/syntaxfm/website/blob/14ecf7d6e60693dc333e997de2c2b4a984db9ecd/src/server/transcripts/utils.ts#L112
And here we filter the flaggings out (less of an issue) https://github.com/syntaxfm/website/blob/14ecf7d6e60693dc333e997de2c2b4a984db9ecd/src/lib/transcript/Transcript.svelte#L23
Got it 👍 I'll take a look at this and see what i can find
So, I'll break this into three issues:
- The flags for speaker detection are sticking around in the transcript view
- Wes or Scott is missing in the entire transcript
- Scott is mislabeled as Announcer
The flags for speaker detection are sticking around in the transcript view
This can be seen here: https://syntax.fm/show/683/spooky-coding-horror-stories-2023-part-1/transcript This is because the transcript attributes "My name is Wes. My dog eats food on" to Wes and "the moon." to Scott, which breaks the Regex.
To fix this:
- The Regex could be even more relaxed
- A search for "startsWith" could be added the same as the line for Scott
- or some change could be made to the ingest of transcripts as they are saved to the DB.
I see the first two as still a little "hacky," but getting this right for all occasions seems complicated.
Wes or Scott is missing in the entire transcript
This issue is because speakers are mislabeled (probably while saving the transcript) with "99" as their speaker id. Then we filter speakers with the "99" id: https://github.com/syntaxfm/website/blob/14ecf7d6e60693dc333e997de2c2b4a984db9ecd/src/lib/transcript/Transcript.svelte#L20
If we don't filter, the speakers still have names, so they show up just fine in the recent shows.
But I'm assuming this was causing an issue on some other shows, so if we have those, I can double-check the filter. On top of removing the filter, we could check for no speaker name, have the entry in the transcript, and label it as "unknown."
Examples: https://syntax.fm/show/726/is-htmx-a-joke/transcript
- Scott has a speakerId of 99 and is filtered out completely
https://syntax.fm/show/727/how-to-code-opinionated-typescript-stack-tooling-choices-explained/transcript
- Wes has a speakerId of 99 and is filtered out completely
Scott is mislabeled as Announcer
Can you provide the show number we were seeing this? I can't find one, but I'm checking a limited subset.
sweet thanks. The speaker ID of 99 is important, - I forget why though. Ill check tomorrow.
I think all of these issues are due to the regex either being too relaxed, or not relaxed enough.
I'd have to check, but I don't think I'm saving the speaker's name in the DB, just the speakers number. The problem with our transcript provider is they don't tell you who is 1 or 2, so we have to do that ourselves.
If I'm following correctly the speaker name is correctly found here (and above): https://github.com/syntaxfm/website/blob/14ecf7d6e60693dc333e997de2c2b4a984db9ecd/src/server/transcripts/utils.ts#L49
Which accounts for any speaker id in conjunction with detectSpeakerNames.
Since the speaker id is saved in the DB, in show 727 your id is 99 instead of 1 or 2. (I think the announcer may be 99 in some older shows?)
Ah yea I remember y'all saying the transcript provider doesn't give the speaker which is why this code is required.
the incorrect marking is fixed. I'd like to figure a way to map the speaker numbers to show guests now.