feature-requests
feature-requests copied to clipboard
Add support for I2S microphone / I²S audio input
Describe the problem you have/What new integration you would like
It would be great to have ESPHome support for an I2S (I²S) microphone like the one on the LILYGO® TTGO T-Camera. This specific model came with a demo firmware that had a voice command to activate the camera. I don't know if there's any source available for this demo.
Please describe your use case for this integration and alternatives you've tried:
- Voice commands (as demonstrated by the demo firmware)
- Text-to-Speech
- (One-way) communication with visitors
Additional context
I originally commented on #599 but that appears to be a request for I2S audio output while I'm requesting support for audio input
I am also a +1 for i2s audo input. For my needs, I only need a rough plot of frequency vs time. No actual processing of the audio... which i imagine to be where most of the difficulty comes in.
I am still doing initial research for this project, but did manage to find this ESP32 project that works with an i2s microphone and comes with a very helpful library for getting the dominant frequencies.
The architecture seems to make heavy use of queues from FreeRTOS. I don't know what faculties ESPHome has for a similar architecture so the above project may not be a trivial port.
I'd be mostly interested in streaming audio via I2s. I've never attempted this but a quick search resulted into finding this very recent GIST https://gist.github.com/GrahamM/1d5ded26b23f808a80520e8c1510713a
There's also a YouTube channel "The Project" with lots of experiments using an INMP441 mic with an ESP32.
I've also bought the same board and the demo with microphone usage looks interested.
Also atomic14 has relevant experiments: https://www.youtube.com/playlist?list=PL5vDt5AALlRfGVUv2x7riDMIOX34udtKD, https://github.com/atomic14/esp32_audio
Output is done: https://esphome.io/components/media_player/i2s_audio.html
If we can use esphome to make a smart speaker similar to google nest mini, the DIY world at home will be more complete.
Starting with a noise monitor use case would be ideal. Then audio baby monitor, then 2-way audio, then full smart speaker.
I would love to have this for use with the M5 Atom Echo smart speaker. It has a button and a mic, my use case should be to press the button and send the audio (until I let go) to HA as a voice command.
i wait for that since 2 years...i have made a doorbell with a esp32cam, capacitive touch buton and a dfplayer for text to speech, but so frustating for the mic...
An other solution i try is jeetsy meet on raspberry pi4, but put a Rpi4 in a Pla box, 3dprint, on the road in front of my house was not a good solution...rain, sun, ....and the delay with a wifi connexion...too bad.
1 year passed since this first post...i think we must wait one more at least...lol
I would also find this really useful. Just the raw audio, maybe over UDP or something. Able to be pulled into other things (Rhasspy, for example).
I have subscribed, as I'm also interested. Would be great to be able to make use of the MEMS microphone over I2S from the Enviro+ to read noise levels. Not so interested for sound input.
The ESP32 Muse from Raspiaudio arrived yesterday which also has a mic. Would be awesome to be able to make use of it!
More examples if of any help:
- https://github.com/atomic14/esp32_wireless_microphone
- https://github.com/ikostoski/esp32-i2s-slm
- https://iotassistant.io/esp32/smart-door-bell-noise-meter-using-fft-esp32/
This is probably superseded by @ristomatti's links but I figure this may still be of interest - a while back I managed to get a sound sensor working in ESPHome with a custom sensor. Cobbled together from various sources and I ended up abandoning it because I couldn't get it to run stably on my board (TTGO camera board) at the same time as the camera, but it seems to run okay without the camera. Someone more knowledgable than me might well be able to fix that though (I hope!).
https://gist.github.com/krisnoble/6ffef6aa68c374b7f519bbe9593e0c4b
I am looking to take this task on. I have some prototypes working on some different mics. I just need a little help on the design of of this.
I am going to start with a basic decibel meter but after that it gets a little more complex. What should the output of a "generic microphone" be?
Can triggers pass arrays of data around or will we need to make a different audio component for each use case ( decibel meter, udpstream, hermes)? What if you have a situation like the M5 echo where the i2s bus is shared for mic and speaker?
@mrhatman please come on Discord, there's our dev channel where our devs are lively answering such questions!
@mrhatman, indeed, implementing just noise meter vs implementing something more generic and flexible might become complicated very quickly. I'm not sure if anyone could easily propose the best architecture/design - that's definitely one of the challenges for someone who's going to try this task. I think it would be helpful, if you provided your own vision, or a few alternatives varying by degree of flexibility/complexity, going from the the most basic (just single noise meter component) to the most flexible one, that might include multiple inputs/outputs.
Considering flexible solution, one might draw some inspiration from such libraries like Teensy Audio Library or Arduino Audio Tools, which operate on an interconnected graph of audio nodes of different types.
For example, you can draw some diagram in GUI design tool:

and export it to code:
AudioInputI2S i2s1; //xy=80,404
AudioFilterBiquad biquad1; //xy=355,180
AudioAnalyzeFFT256 fft256_1; //xy=373,354
AudioAnalyzePeak peak1; //xy=385,593
AudioAnalyzeRMS rms1; //xy=392,447
AudioOutputI2S i2s2; //xy=681,314
AudioConnection patchCord1(i2s1, 0, biquad1, 0);
AudioConnection patchCord2(i2s1, 0, rms1, 0);
AudioConnection patchCord3(i2s1, 0, fft256_1, 0);
AudioConnection patchCord4(i2s1, 1, peak1, 0);
AudioConnection patchCord5(biquad1, 0, i2s2, 0);
Applying this approach to ESPHome and noise meter task, a hypothetical config might look something like this:
audio:
- platform: i2s_input
id: i2s_1
lrclk_pin: GPIO33
dout_pin: GPIO22
bclk_pin: GPIO19
mode: mono
sampling_rate: 48000
- platform: iir_filter
id: mic_eq
b: [1, 2, 3]
a: [3, 4, 5]
- platform: noise_meter
id: noise_meter_1
freq_weighting: A
time_weighting: fast
Leq:
name: LAeq_fast
Lpeak:
name: LApeak
Lmax:
name: LAmax_fast
- platform: noise_meter
id: noise_meter_2
freq_weighting: C
time_weighting: slow
Leq:
name: LCeq_slow
Lpeak:
name: LCpeak
- connections:
- source: i2s_1
destination: mic_eq
- source: mic_eq
destination: noise_meter_1
- source: mic_eq
destination: noise_meter_2
Of course, I understand that it would require much more work than if you just implemented a single noise meter component, so maybe there could be other practical options, not as flexible as this approach, but flexible enough and simpler to implement.
PS. I'm not ESPHome nor audio expert, so don't take it too seriously.
I am looking to take this task on.
<3
I have some prototypes working on some different mics. I just need a little help on the design of of this.
I am going to start with a basic decibel meter but after that it gets a little more complex. What should the output of a "generic microphone" be?
For use with ESPHome, I think the most value is in a component that works like what @stas-sl mocked up; a series of small/single-purpose components that are piped together as needed.
I would imagine that most people looking to use audio IN with ESPHome are trying to react to loud noises or some specific sound (glass breaking, dog barking, hands clapping) or possibly a few different words/spoken commands (which, really, are just filters for certain waveforms like the sound of glass breaking but are distinctly more challenging!). I'll come back to the spoken word / specific sounds and bi-directional audio in a sec... those are more complicated!
The algorithms for filtering audio out specific frequency bands or energy levels could be extensions of the existing ESPHome filters so it would then be pretty simple to have a basic "template number" that is incremented every time audio between X and Y hz and above Z dB is detected. With some basic tuning around the dB level, building a "when I clap three times quickly, turn this GPIO on/off" configuration in ESPHome should be easy enough which is ... probably enough for most use cases.
As for voice / bi-di audio: there's a reason why Alexa/GoogleHome are "dumb pipes" that are good at detecting a wake word and piping all the audio directly to some very beefy computers.
There are entire frameworks and platforms like ESPHome that are dedicated to the more complicated waveform/pattern-matching required for arbitrary command recognition. There's a whole ECOSYSTEM of training data / models / tuning for the models that then get plugged into the frameworks and I don't think "add TensorLite and TinyML into ESPHome" is a trivial task... certainly a bit more work than "implement basic microphone support"!
Additionally, figuring out if/how to gracefully degrade the experience on ESP32 devices that don't have dedicated peripherals is tricky. The S3 version does have a dedicated co-processor for this stuff but only works with models that Espressif supplies...I think. At any rate I remember that they have supported "wake words" and then you have to pay them to build you a new model for a new wake word the last time I looked into this ~ a year ago. I don't know if the process of building the model was every open sourced. I do not know what the plan/strategy would be for people that use the Arduino
frameworks with ESPHome rather than the esp-idf
framework. This might be one of those situations where you'll have to pick Arduino exclusive features (like web server
) or audo reactivity (only implemented with esp-idf
).
Could you use TensorFlow just on the CPU with a model / wake word of your choosing? Yes. Will there be enough additional CPU to also handle MQTT, BTLE connections, OTA updates and all the other stuff that ESPHome is doing? Maybe. Will TensorFlow/TinyML be easy to shove into the "ESPHome will call you every once in a while" loop/model? I don't know.
Ultimately scoping the work to a more basic "users can react to certain levels of energy in certain frequency bands" is achievable without expanding the work to "and also get TinyML integrated with ESPHome" and probably works better in the general ESPHome paradigm where your "do_work()" function will be called every once in a while by the main ESPHome loop rather than being the dedicated loop that's constantly waiting for new data from the FFT or possible an interrupt from the audo/ml co-processor"
Exposing (a portion of) the underlying API to get the raw samples should be possible for people wishing to do more complicated things like voice/word detection or encoding/streaming the audio to another computer but trying to add "first class" support for that might be tricky!
Can triggers pass arrays of data around or will we need to make a different audio component for each use case ( decibel meter, udpstream, hermes)? What if you have a situation like the M5 echo where the i2s bus is shared for mic and speaker?
More than a few ESPHome components use the "sensor on a bus" model. E.G.: you feed an instance of the modbus
component into various sensors.
This might mean a slight re-factor of the existing i2s_audio
.
From:
# Example configuration entry
media_player:
- platform: i2s_audio
name: ESPHome I2S Media Player
dac_type: external
i2s_lrclk_pin: GPIO33
i2s_dout_pin: GPIO22
i2s_bclk_pin: GPIO19
mode: mono
to:
i2s:
- id: main_audio_bus
i2s_lrclk_pin: GPIO33
i2s_dout_pin: GPIO22
i2s_bclk_pin: GPIO19
media_player:
- platform: i2s_audio
i2s_bus: main_audio_bus
dac_type: external
mode: mono
microphone:
- id: ext_mic
platform: i2s_microphone
i2s_bus: main_audio_bus
sensor:
- platform: template
name: "Clap Count"
filters:
# Only care about sounds between X and Y hz
# Might also combine with OR: to react to multiple bands
- audio_freaq_filter_out:
lower: 4.3 kHz
upper: 5.0 kHz
# More filters here to check the db level and then to to sum up the number of times the db threshold is crossed
Thoughs?
I am really likely the pipe layout for audio and modeling it off of arduino audio with the sources and sinks. Likely using callback to communicate buffers of data from component to component. You can create various audio sources:
- I2s
- ADC ( internal or external)
- Bluetooth audio ( connect to you phone to stream music around)
- Media Player
Different filters and passthrough operations:
- Frequency filter
- Pop rejection
- Levelizer
And finally you could have different audio sinks:
- I2S
- DAC
- Bluetooth speaker
- Hermes
- UDP audio stream
- Tensorflow
- Wakeword engine
- Clap detector
i2s would likely be both a source and sink on devices but that should work.
Some examples:
#Existing Media Player
media_player:
- id: media_player
i2s:
- id: main_audio_bus
i2s_lrclk_pin: GPIO33
i2s_dout_pin: GPIO22
i2s_bclk_pin: GPIO19
dac_type: external
mode: mono
output_audio_source: media_player
#Decibel Meter
decibel_meter:
- id: decibel_meter
audio_source: main_audio_bus
i2s:
- id: main_audio_bus
i2s_lrclk_pin: GPIO33
i2s_din_pin: GPIO22
i2s_bclk_pin: GPIO19
mode: mono
#Filter Example
decibel_meter:
- id: decibel_meter
audio_source: filtered_audio_bus
audio_filter:
-id: filtered_audio_bus
audio_source: main_audio_bus
i2s:
- id: main_audio_bus
i2s_lrclk_pin: GPIO33
i2s_din_pin: GPIO22
i2s_bclk_pin: GPIO19
mode: mono
Thanks for looking at this. Make sure you link up with the ESPHome team, since I think Nabu Casa are prioritising audio projects in 2023.
Look at the i2c examples for inspiration; I think it's exactly analogous. They use i2c
components to define the buses, then sensor
components that reference those bus_ids.
I'm glad there is some kind of agreement on higher level that it should be a graph of nodes piped together to make it as flexible and extendable as possible. However I'd like to discuss some smaller details, which seem important as well. I don't have strong opinion on them, but I want to just explain what make sense to me and why.
- I propose to move all audio related components under a separate config section/c++ namespace. The same way as light/fan/display components have their own section in config. Instead of scattering all these components like media_player, microphone on the top level.
audio:
- platform: i2s
- platform: media_player
- platform: microphone
- platform: gain
- platform: equalizer
- platform: noise_meter
...
Well, actually I'm not sure, if there should be microphone
component at all, as it is just essentially i2s device/stream/bus whatever.
- Specifying node connections. So, besides nodes themselves, you need to specify which outputs go into which inputs. There are probably several ways to do it: 1) as @mrhatman did it, by specifying
audio_source
oroutput_audio_source
per each component, or 2) by specifying all the connections separately, after all nodes were declared, as I showed in my example above, inspired by how it is done in Teensy library. Either option should work, but IMHO, the second way is more clean and easy to understand, especially if there would be larger graph and each component would have multiple inputs/outputs. For example, the immediate question I have if it would be the 1st option, is where you should specify connections: in the child node specifying inputs, or in the parent node specifying outputs. I see both optionsaudio_source
andoutput_audio_source
in the example - it looks a bit confusing to me. Unlike Teensy library, Arduino Audio Tools doesn't have a single way of specifying connections: sometimes they pass input stream to the constructor, sometimes output stream, and sometimes a separate stream copier class is responsible for data propagation. I haven't seen bigger examples having more than 3-4 nodes, but I guess it might become not very easy to understand how the flow goes.
I2SStream in;
I2SStream out;
// copy filtered values
FilteredStream<int16_t, float> filtered(in, channels); // Defiles the filter as BaseConverter
StreamCopy copier(out, filtered); // copies sound into i2s
void loop() {
copier.copy();
}
SineWaveGenerator<int16_t> sine_wave(32000); // subclass of SoundGenerator with max amplitude of 32000
GeneratedSoundStream<int16_t> in_stream(sine_wave); // Stream generated from sine wave
CsvStream<int16_t> out(Serial, to_channels); // Output to Serial
ChannelFormatConverterStreamT<int16_t> conv(out);
StreamCopy copier(conv, in_stream); // copies sound to out
void loop() {
copier.copy();
}
I don't know the exact syntax (I can propose a few), how you can specify connections separately in config, but I see benefits of declaring them outside of nodes themselves.
-
media_player component definitely has to be changed a lot with this approach. As I understand, currently it has 2 responsibilities: 1) loading/decoding media from url, 2) streaming audio data to i2s. With proposed approach the 2nd part will be implemented as a separate node/component like I2SStream/I2SDevice, so media_player should only be responsible for loading the data and piping it downstream. As I mentioned in the 1st point, I would propose to not use separate config section on the top level for media_player, but instead move it under
audio
section, the same way as many others possible audio sources. -
@mrhatman, I'm not sure I fully understand how you are going to use callbacks, but I would highly recommend learning how it is implemented in those libraries I mentioned or maybe some others, to not reinvent the wheel. As I understand there should be some
AudioStream
base class withread/write
methods and each node should be aware of the very next child nodes, so after the data is written to the parent node and it processed it somehow, it should pass it downstream to all children, either by calling for each childchild->write(data)
directly or by queueing/scheduling it. Of course, that is my naive and simplistic understanding and it has to be much more complicated than this.
This initial discussion about I2S audio has kind of ballooned into a discussion about how do we want audio handling in ESP home so I agree with you @jamesmyatt we might want to move this discussion to include the ESP home team. I just would not know where to start there.
I am going to start prototyping up 2 proof of concepts with the node architecture and see how people feel about the implementation:
- an I2S microphone that feeds into a basic decibel / RMS meter
- a sine wave generator that feeds into a I2S speaker on the same bus as the mic
I feel these are the basic demos that will give a feel for how the audio handling will work
@stas-sl to respond to your thoughts:
- I think this makes sense, I like things being organized. Microphone will be replaced by I2S in or something like that. it is just an audio source
- To clarify a bit here, an I2S bus in my example could be both a input and output. Output audio source was poorly named. I think a bus setup like @jamesmyatt suggested like clarify that better.
- I agree, the existing media player will just be an audio source with some nice to haves, pausing, playing, etc and then we can forward it to any number of audio sinks like I2S speaker, bluetooth speaker, etc.
- The callbacks were basically the equivalent of read/write as the data from the audio source might be intermittent.
https://github.com/syssi/esphome-zeroconf https://github.com/esphome/esphome/pull/4202 Could be used to announce media services on the network with mDNS
First pass of the decibel meter using I2S audio
https://github.com/esphome/esphome/commit/d7fc67c5daaab408fe80d4b9d254fcfd306dc026
Not finished and needs a lot of work but it works for me at least.
Tested on m5 echo using this config:
audio_source:
- platform: i2s_audio_source
id: i2s_mic
name: "i2s_mic"
i2s_lrclk_pin: GPIO33
i2s_din_pin: GPIO23
i2s_bclk_pin: GPIO19
sensor:
- platform: decibel_meter
sensitivity: -23
audio_source: i2s_mic
name: decibels
If I should make a PR to track instead let me know.
If I should make a PR to track instead let me know.
Looks good. You'll want a PR once you're ready for review from @jesserockz so may as well open one now.
Anyone could please share a list with supported hardware?
I think you need to add fields like "update_interval",
Anyone could please share a list with supported hardware?
As of right now, the only tested microphone is the SPM1423 PDM but any mic that supports an I2S interface should work
How do we want to handle audio specifications for the connections? Mono/Stereo, Bytes per sample, Frequency, etc. The current example is hardcoded to 16kHz, 16 bit, Mono
I think the easiest way would be to have the audio sources declare their configuration and let the receiving component handle the conversion if needed.
audio_source:
- platform: i2s_audio_source
id: i2s_mic
name: "i2s_mic"
i2s_lrclk_pin: GPIO33
i2s_din_pin: GPIO23
i2s_bclk_pin: GPIO19
bits_per_sample: 16
audio_freqency: 16000
mode: mono
sensor:
- platform: decibel_meter
sensitivity: -23
audio_source: i2s_mic
name: decibels