Aegisub icon indicating copy to clipboard operation
Aegisub copied to clipboard

Automatically detect non-dialog audio and display them as short text info

Open cutegitcat opened this issue 6 months ago • 7 comments

Many videos, fictional and non-fictional use non-speech sounds and noises – such as footsteps, doors, animals, music, traffic, explosions as integral elements of the storytelling. For deaf people like me, this kind of information is essential to understand the context of a scene, but it lacks often in available subtitles/captions. Automatic ambient sound detection would significantly improve media accessibility for many deaf and hard of hearing users. But currently, most tools do not detect or transcribe such sounds accurately and reliably, if at all. However, there are toolkits available that might help to enhance this.

Two projects I found: • YAMNet – a sound classification model that can run locally: https://www.tensorflow.org/hub/tutorials/yamnet • Vosk – offline speech recognition, works without internet: https://github.com/alphacep/vosk-api

It would be great if the functionality of such a toolkit could be integrated into your tool to create subtitles/captions that reliably include not only dialogue, but also noises, and other non-speech sound elements. I am not a developer, and cannot judge the feasibility of this idea. Please understand my feature request as an enhancement to your already great product, that helps us to already in many ways better access the world of sounds and speech in media.

Thank you to the development team for all their hard work so far, and best of luck with future improvements!

cutegitcat avatar Jun 27 '25 18:06 cutegitcat

I'll chime in to provide an evaluation. I'm no expert in audio classification algorithms, so take my word with a grain of salt.

Unlike Whisper (a breakthrough that is mostly due to the data-agnostic learning in Transformer architecture) which has evolved as a readily available open-source component, transcribing non-speech sound as a task does not have a viable product available off-the-shelf. Some most recent progress is an academic dataset project called Nonspeech7k, which is only for human non-speech sound, and serves only as a benchmark. The other project you've mentioned for general audio/sound classification is for classification, which is only tagging ranged clips with predefined classes. It cannot precisely timestamp without prior timing work, and provides hardly any semantic information.

tl;dr: An infeasible task currently even for upstream researchers.


Furthermore, an important distinction shall be made (Writing this only as my personal opinion), that a "subtitle editor" is not necessarily a "subtitle maker". Many functionalities for Aegisub are geared toward power users to manually produce graphically advanced subtitle work, not for simplified one-key subbing of conventional audiences. Thus, to quote arch's words:

adding something like Whisper would be a lot of work for something that would also work fairly well as an external tool.

Accessibility is indeed important to provide equal access to media content for disabled users, and that's why I kindly suggest the author to seek assistance as an individual transcribing product, which would be a much more useful component for other software to also incorporate as a functionality, just like what Whisper initially developed as.

EL-File4138 avatar Jun 27 '25 21:06 EL-File4138

Hello and dear EL-File4138

Thank you very much for the tip – it’s really interesting to learn more about NonSpeech7k! I’m also not a researcher myself, but perhaps someone in the community will be able to compare the different tools and suggest the most suitable one for integration.

In addition to OpenAI’s standard Whisper – and unfortunately Faster-Whisper, which tends to be unreliable in many cases – most tools only focus on spoken language. Tools such as NonSpeech7k, YAMNet, or possibly even better alternatives, can be a valuable addition. They help close information gaps by transcribing acoustic signals such as footsteps, door noises, animals, music, traffic, or explosions – anything that provides important context for people who are deaf or hard of hearing.

Here too, it applies: most subtitles/captions today are generated automatically and only checked for accuracy at the end. This saves a lot of time and allows large amounts of media content to be made accessible more efficiently.

In today’s digital and AI-driven world, hardly anyone can create subtitle/caption files manually from scratch. Automatic tools with integrated recognition of speech and sounds make the work much easier. A final quality check – only for content and spelling – is often enough. This not only saves time, but also improves accessibility for everyone.

cutegitcat avatar Jun 30 '25 09:06 cutegitcat

I don't think you're reading what I'm saying right. What I'm saying is that these tool are far from usable, thus not up to the standard of inclusion. Your argument are indeed reasonable, but in practice they don't work. We don't care about whether you're a developer are not, if you do want to see this functionality gets included, it is your responsibility to show us a competent component (whether written by, or commissioned by you) that can be integrated with ease first.

Furthermore, we still have a thriving fan-subbing community manually producing high quality subtitles for many productions and works, which is largely why the project is on focus even after years of hiatus and is still being revived. "hardly anyone can create subtitle/caption files manually from scratch", albeit worded rather with magnificent condition like "AI driven age", sounds like much of an unrespect to the hard work of the community to me.

EL-File4138 avatar Jun 30 '25 11:06 EL-File4138

Many videos, fictional and non-fictional use non-speech sounds and noises – such as footsteps, doors, animals, music, traffic, explosions as integral elements of the storytelling. For deaf people like me, this kind of information is essential to understand the context of a scene, but it lacks often in available subtitles/captions. Automatic ambient sound detection would significantly improve media accessibility for many deaf and hard of hearing users. But currently, most tools do not detect or transcribe such sounds accurately and reliably, if at all. However, there are toolkits available that might help to enhance this.

Two projects I found: • YAMNet – a sound classification model that can run locally: https://www.tensorflow.org/hub/tutorials/yamnet • Vosk – offline speech recognition, works without internet: https://github.com/alphacep/vosk-api

It would be great if the functionality of such a toolkit could be integrated into your tool to create subtitles/captions that reliably include not only dialogue, but also noises, and other non-speech sound elements. I am not a developer, and cannot judge the feasibility of this idea. Please understand my feature request as an enhancement to your already great product, that helps us to already in many ways better access the world of sounds and speech in media.

Thank you to the development team for all their hard work so far, and best of luck with future improvements!

Feel free to create an automation script like this one that uses Whisper: https://github.com/Ghegghe/aegis-lua-scripts

bebetoalves avatar Jun 30 '25 12:06 bebetoalves

i don't think it's aegisub problem but it's audio to text problem. could you ask it to whisper team instead? thanks.

amanosatosi avatar Jul 17 '25 15:07 amanosatosi

Hello and dear @bebetoalves Thank you very much for developing this additional feature! Unfortunately, I was unable to use it successfully on my Windows11 computer with the portable version. It's a shame that the solution is too complicated for me at the moment as it doesn't offer the functionality I need. I really value your work, and I hope that we will find an even better solution in the future.

Dear @amanosatosi Unfortunately, AegiSub does not have an integrated audio-to-text function, which many people would find useful. This would save you from having to listen to something and type it out manually every time, which is very time-consuming. OpenAI's Whisper comes in two different versions: Whisper and Fast-Whisper. Fast-Whisper does not transcribe audio information. This is disadvantageous for people who are deaf or hard of hearing. Hopefully, one day there will be a simple solution that everyone is happy with.

cutegitcat avatar Jul 29 '25 09:07 cutegitcat

@cutegitcat detecting audio was never aegisub's job. it is used for making subtitles and all sort and different kind of subtitles. but whisper's job is to make subtitle data from audio. that why you should request to whisper team.

amanosatosi avatar Sep 27 '25 16:09 amanosatosi