android
android copied to clipboard
Fixed TTS queuing mechanism and volume override resets
Fixes https://github.com/home-assistant/android/issues/4390.
Issues
TTS static reference
The original issue causing the TTS queue to fail was due to a single, statically stored reference to the TextToSpeech
instance. Each new speakText
call created a new instance, overriding the previous reference:
https://github.com/home-assistant/android/blob/50506f06ca6130b40c61bcf198302529388a9fae/common/src/main/java/io/homeassistant/companion/android/common/util/TextToSpeech.kt#L24-L40
Each utterance attempted to clean up after itself in UtteranceProgressListener#onDone
, but it used the static reference to the TextToSpeech
instance to do so. This caused it to shut down whatever was last queued:
https://github.com/home-assistant/android/blob/50506f06ca6130b40c61bcf198302529388a9fae/common/src/main/java/io/homeassistant/companion/android/common/util/TextToSpeech.kt#L53-L55
If more than one utterance was dispatched, each finished utterance would clear the last queued utterance before it could play. Any utterance that did finish playing couldn't clean up after itself because it had no reference to do so.
Maximized volume reset
While fixing the reference issue, I noticed that the feature which maximizes the alarm stream volume (when alarm_stream_max
is present in the data) reads the current volume level at queuing time to use as the reset volume level when the utterance finishes. If an alarm_stream_max
utterance was queued while another was playing, the queued one would store the currently maximized volume level by the previous utterance as its reset volume value:
https://github.com/home-assistant/android/blob/50506f06ca6130b40c61bcf198302529388a9fae/common/src/main/java/io/homeassistant/companion/android/common/util/TextToSpeech.kt#L33
This resulted in the alarm stream volume being permanently set to the maximum value.
Solution
To resolve the issues mentioned above, I am introducing a new TextToSpeechClient
class. This class is responsible for queuing utterances and dispatching them at the correct time to a TextToSpeechEngine
, which is solely responsible for playing the utterance. By default, I am providing the AndroidTextToSpeechEngine
class, which uses Android's TextToSpeech
to synthesize and play back text. This abstraction makes the code easier to test and opens the door for introduction of other TTS engines in the future.
PlantUML source
@startuml
interface TextToSpeechEngine {
+ suspend initialize(): Result<Unit>
+ suspend play(Utterance): Result<Unit>
+ release()
}
note top: Base interface for engines that can synthesize and play back a message.
class AndroidTextToSpeechEngine {
- val initMutex: Mutex
- var textToSpeech: TextToSpeech?
- var lastVolumeOverridingUtterance: Utterance?
+ <<Create>> AndroidTextToSpeechEngine(appContext: Context)
+ suspend initialize(): Result<Unit>
+ suspend play(Utterance): Result<Unit>
+ release()
}
note bottom: Default implementation of a TTS engine that uses platform's built in `TextToSpeech` engine.
note right of AndroidTextToSpeechEngine::initMutex
Mutex is locked during initialization to ensure that only one init call is running at a time
and that subsequent #play() function calls also wait for init's result.
end note
note right of AndroidTextToSpeechEngine::lastVolumeOverridingUtterance
Stored reference of the last volume overriding utterance
so that it can be used to reset volume back in case of the playback being interrupted.
end note
class Utterance {
+ val id: String
+ val text: String
+ val streamVolumeAdjustment: StreamVolumeAdjustment
+ val audioAttributes: AudioAttributes
}
note top: Data model that contains the text to be synthesized,\naudio stream attributes that should be used,\nand a helper to adjust volume before and after playback.
class StreamVolumeAdjustment {
+ overrideVolume()
+ resetVolume()
}
note left: Utility object that interacts with AudioManager to override audio channel's volume.
class StreamVolumeAdjustment$None {
+ overrideVolume()
+ resetVolume()
}
note bottom: Implementation of the adjustment that doesn't produce any side effects.
class StreamVolumeAdjustment$Maximize {
- val maxVolume: Int
- var resetVolume: Int?
+ <<Create>> Maximize(audioManager: AudioManager, audioStreamId: Int)
+ overrideVolume()
+ resetVolume()
}
note left of StreamVolumeAdjustment$Maximize::resetVolume
Queried and stored right before #overrideVolume is called and used in a following #resetVolume call.
end note
StreamVolumeAdjustment <|-- StreamVolumeAdjustment$None: extends
StreamVolumeAdjustment <|-- StreamVolumeAdjustment$Maximize: extends
TextToSpeechEngine <|.. AndroidTextToSpeechEngine: implements
class TextToSpeechClient {
- val utteranceQueue: ArrayDeque<Utterance>
+ <<Create>> TextToSpeechClient(appContext: Context, engine:TextToSpeechEngine)
+ speakText(data: Map<String, String>)
+ stopTTS()
- play()
- handleError(message: String)
}
note top: Entry point for speech synthesis and playback, maintains a FIFO queue of utterances.
note left of TextToSpeechClient::speakText
Queues utterances and starts playback, if not yet started.
end note
note left of TextToSpeechClient::stopTTS
Clears the queue and stops the playback.
end note
note left of TextToSpeechClient::play
Initializes TTS engine, pops an utterance from the queue, and plays it.
Once the playback is finished, tries popping another utterance.
If the queue is empty, releases the engine.
end note
TextToSpeechEngine::play --> Utterance: uses
Utterance::streamVolumeAdjustment --> StreamVolumeAdjustment: uses
TextToSpeechClient::TextToSpeechClient --> TextToSpeechEngine: uses
@enduml
With tighter control over the queuing mechanism and visibility into when each message starts and finishes playing, the engine reads the current volume level value right before playback begins. This ensures it always stores the correct reset volume level value for when playback ends.
Decoupling the synthesis/playback engine from the queue allows for the reuse of a single TextToSpeech
instance to play all queued messages. This removes the latency of initializing a new instance for each utterance and avoids potential memory usage increases associated with storing multiple instances simultaneously. From my testing, the default TextToSpeech
implementation shares memory across multiple instances as long as a single language is used, but this optimization might still benefit other engines in the future.
The new architecture allows for much easier testing due to the clear separation of concerns and the elimination of static objects. However, I noticed that there are no unit or instrumentation test packages included in the project. I did not want to impose any testing infrastructure on the project, so please let me know if there's a right way to contribute some test coverage on new/existing features.
Future improvements
The changes introduced in this PR allow for the future introduction of other TTS engines. Additionally, I recommend looking into audio focus management to ensure that the utterances played by the Home Assistant app blend seamlessly with other media.
A priority queue for alarm_stream_max
utterances could also be considered, as users likely send these at critical moments.