android icon indicating copy to clipboard operation
android copied to clipboard

Fixed TTS queuing mechanism and volume override resets

Open LukasPaczos opened this issue 7 months ago • 2 comments

Fixes https://github.com/home-assistant/android/issues/4390.

Issues

TTS static reference

The original issue causing the TTS queue to fail was due to a single, statically stored reference to the TextToSpeech instance. Each new speakText call created a new instance, overriding the previous reference: https://github.com/home-assistant/android/blob/50506f06ca6130b40c61bcf198302529388a9fae/common/src/main/java/io/homeassistant/companion/android/common/util/TextToSpeech.kt#L24-L40

Each utterance attempted to clean up after itself in UtteranceProgressListener#onDone, but it used the static reference to the TextToSpeech instance to do so. This caused it to shut down whatever was last queued: https://github.com/home-assistant/android/blob/50506f06ca6130b40c61bcf198302529388a9fae/common/src/main/java/io/homeassistant/companion/android/common/util/TextToSpeech.kt#L53-L55

If more than one utterance was dispatched, each finished utterance would clear the last queued utterance before it could play. Any utterance that did finish playing couldn't clean up after itself because it had no reference to do so.

Maximized volume reset

While fixing the reference issue, I noticed that the feature which maximizes the alarm stream volume (when alarm_stream_max is present in the data) reads the current volume level at queuing time to use as the reset volume level when the utterance finishes. If an alarm_stream_max utterance was queued while another was playing, the queued one would store the currently maximized volume level by the previous utterance as its reset volume value: https://github.com/home-assistant/android/blob/50506f06ca6130b40c61bcf198302529388a9fae/common/src/main/java/io/homeassistant/companion/android/common/util/TextToSpeech.kt#L33

This resulted in the alarm stream volume being permanently set to the maximum value.

Solution

To resolve the issues mentioned above, I am introducing a new TextToSpeechClient class. This class is responsible for queuing utterances and dispatching them at the correct time to a TextToSpeechEngine, which is solely responsible for playing the utterance. By default, I am providing the AndroidTextToSpeechEngine class, which uses Android's TextToSpeech to synthesize and play back text. This abstraction makes the code easier to test and opens the door for introduction of other TTS engines in the future.

tts_queue_plantUML

PlantUML source
@startuml
interface TextToSpeechEngine {
+ suspend initialize(): Result<Unit>
+ suspend play(Utterance): Result<Unit>
+ release()
}
note top: Base interface for engines that can synthesize and play back a message.


class AndroidTextToSpeechEngine {
- val initMutex: Mutex
- var textToSpeech: TextToSpeech?
- var lastVolumeOverridingUtterance: Utterance?
+ <<Create>> AndroidTextToSpeechEngine(appContext: Context)
+ suspend initialize(): Result<Unit>
+ suspend play(Utterance): Result<Unit>
+ release()
}
note bottom: Default implementation of a TTS engine that uses platform's built in `TextToSpeech` engine.
note right of AndroidTextToSpeechEngine::initMutex
  Mutex is locked during initialization to ensure that only one init call is running at a time
  and that subsequent #play() function calls also wait for init's result.
end note
note right of AndroidTextToSpeechEngine::lastVolumeOverridingUtterance
  Stored reference of the last volume overriding utterance
  so that it can be used to reset volume back in case of the playback being interrupted.
end note


class Utterance {
+ val id: String
+ val text: String
+ val streamVolumeAdjustment: StreamVolumeAdjustment
+ val audioAttributes: AudioAttributes
}
note top: Data model that contains the text to be synthesized,\naudio stream attributes that should be used,\nand a helper to adjust volume before and after playback.

  
class StreamVolumeAdjustment {
+ overrideVolume()
+ resetVolume()
}
note left: Utility object that interacts with AudioManager to override audio channel's volume.


class StreamVolumeAdjustment$None {
+ overrideVolume()
+ resetVolume()
}
note bottom: Implementation of the adjustment that doesn't produce any side effects.


class StreamVolumeAdjustment$Maximize {
- val maxVolume: Int
- var resetVolume: Int?
+ <<Create>> Maximize(audioManager: AudioManager, audioStreamId: Int)
+ overrideVolume()
+ resetVolume()
}
note left of StreamVolumeAdjustment$Maximize::resetVolume
  Queried and stored right before #overrideVolume is called and used in a following #resetVolume call.
end note

StreamVolumeAdjustment <|-- StreamVolumeAdjustment$None: extends
StreamVolumeAdjustment <|-- StreamVolumeAdjustment$Maximize: extends
TextToSpeechEngine <|.. AndroidTextToSpeechEngine: implements


class TextToSpeechClient {
- val utteranceQueue: ArrayDeque<Utterance>
+ <<Create>> TextToSpeechClient(appContext: Context, engine:TextToSpeechEngine)
+ speakText(data: Map<String, String>)
+ stopTTS()
- play()
- handleError(message: String)
}
note top: Entry point for speech synthesis and playback, maintains a FIFO queue of utterances.
note left of TextToSpeechClient::speakText
  Queues utterances and starts playback, if not yet started.
end note
note left of TextToSpeechClient::stopTTS
  Clears the queue and stops the playback.
end note
note left of TextToSpeechClient::play
  Initializes TTS engine, pops an utterance from the queue, and plays it.
  Once the playback is finished, tries popping another utterance.
  If the queue is empty, releases the engine.
end note


TextToSpeechEngine::play --> Utterance: uses
Utterance::streamVolumeAdjustment --> StreamVolumeAdjustment: uses
TextToSpeechClient::TextToSpeechClient --> TextToSpeechEngine: uses
@enduml

With tighter control over the queuing mechanism and visibility into when each message starts and finishes playing, the engine reads the current volume level value right before playback begins. This ensures it always stores the correct reset volume level value for when playback ends.

Decoupling the synthesis/playback engine from the queue allows for the reuse of a single TextToSpeech instance to play all queued messages. This removes the latency of initializing a new instance for each utterance and avoids potential memory usage increases associated with storing multiple instances simultaneously. From my testing, the default TextToSpeech implementation shares memory across multiple instances as long as a single language is used, but this optimization might still benefit other engines in the future.

The new architecture allows for much easier testing due to the clear separation of concerns and the elimination of static objects. However, I noticed that there are no unit or instrumentation test packages included in the project. I did not want to impose any testing infrastructure on the project, so please let me know if there's a right way to contribute some test coverage on new/existing features.

Future improvements

The changes introduced in this PR allow for the future introduction of other TTS engines. Additionally, I recommend looking into audio focus management to ensure that the utterances played by the Home Assistant app blend seamlessly with other media.

A priority queue for alarm_stream_max utterances could also be considered, as users likely send these at critical moments.

LukasPaczos avatar Jun 26 '24 07:06 LukasPaczos