feature-requests Add support for I2S microphone / I²S audio input

Describe the problem you have/What new integration you would like

It would be great to have ESPHome support for an I2S (I²S) microphone like the one on the LILYGO® TTGO T-Camera. This specific model came with a demo firmware that had a voice command to activate the camera. I don't know if there's any source available for this demo.

Please describe your use case for this integration and alternatives you've tried:

Voice commands (as demonstrated by the demo firmware)
Text-to-Speech
(One-way) communication with visitors

Additional context

I originally commented on #599 but that appears to be a request for I2S audio output while I'm requesting support for audio input

Jun 09 '21 14:06 TheGroundZero

I am also a +1 for i2s audo input. For my needs, I only need a rough plot of frequency vs time. No actual processing of the audio... which i imagine to be where most of the difficulty comes in.

I am still doing initial research for this project, but did manage to find this ESP32 project that works with an i2s microphone and comes with a very helpful library for getting the dominant frequencies.

The architecture seems to make heavy use of queues from FreeRTOS. I don't know what faculties ESPHome has for a similar architecture so the above project may not be a trivial port.

Jul 11 '21 20:07 kquinsland

I'd be mostly interested in streaming audio via I2s. I've never attempted this but a quick search resulted into finding this very recent GIST https://gist.github.com/GrahamM/1d5ded26b23f808a80520e8c1510713a

Jan 12 '22 01:01 ristomatti

There's also a YouTube channel "The Project" with lots of experiments using an INMP441 mic with an ESP32.

Jan 12 '22 01:01 ristomatti

I've also bought the same board and the demo with microphone usage looks interested.

Jan 17 '22 20:01 SANiMirrorOfLight

Also atomic14 has relevant experiments: https://www.youtube.com/playlist?list=PL5vDt5AALlRfGVUv2x7riDMIOX34udtKD, https://github.com/atomic14/esp32_audio

Jun 26 '22 16:06 jamesmyatt

Output is done: https://esphome.io/components/media_player/i2s_audio.html

Jul 01 '22 07:07 nagyrobi

If we can use esphome to make a smart speaker similar to google nest mini, the DIY world at home will be more complete.

Jul 29 '22 07:07 hmjvaline

Starting with a noise monitor use case would be ideal. Then audio baby monitor, then 2-way audio, then full smart speaker.

Jul 29 '22 12:07 jamesmyatt

I would love to have this for use with the M5 Atom Echo smart speaker. It has a button and a mic, my use case should be to press the button and send the audio (until I let go) to HA as a voice command.

Aug 02 '22 14:08 jamescadd

i wait for that since 2 years...i have made a doorbell with a esp32cam, capacitive touch buton and a dfplayer for text to speech, but so frustating for the mic...

An other solution i try is jeetsy meet on raspberry pi4, but put a Rpi4 in a Pla box, 3dprint, on the road in front of my house was not a good solution...rain, sun, ....and the delay with a wifi connexion...too bad.

1 year passed since this first post...i think we must wait one more at least...lol

Aug 06 '22 20:08 Electronlibre2012

I would also find this really useful. Just the raw audio, maybe over UDP or something. Able to be pulled into other things (Rhasspy, for example).

Aug 23 '22 01:08 Oliver2213

I have subscribed, as I'm also interested. Would be great to be able to make use of the MEMS microphone over I2S from the Enviro+ to read noise levels. Not so interested for sound input.

Sep 21 '22 15:09 webash

The ESP32 Muse from Raspiaudio arrived yesterday which also has a mic. Would be awesome to be able to make use of it!

Sep 21 '22 15:09 jamescadd

More examples if of any help:

https://github.com/atomic14/esp32_wireless_microphone
https://github.com/ikostoski/esp32-i2s-slm
https://iotassistant.io/esp32/smart-door-bell-noise-meter-using-fft-esp32/

Sep 22 '22 14:09 ristomatti

This is probably superseded by @ristomatti's links but I figure this may still be of interest - a while back I managed to get a sound sensor working in ESPHome with a custom sensor. Cobbled together from various sources and I ended up abandoning it because I couldn't get it to run stably on my board (TTGO camera board) at the same time as the camera, but it seems to run okay without the camera. Someone more knowledgable than me might well be able to fix that though (I hope!).

https://gist.github.com/krisnoble/6ffef6aa68c374b7f519bbe9593e0c4b

Sep 24 '22 17:09 krisnoble

I am looking to take this task on. I have some prototypes working on some different mics. I just need a little help on the design of of this.

I am going to start with a basic decibel meter but after that it gets a little more complex. What should the output of a "generic microphone" be?

Can triggers pass arrays of data around or will we need to make a different audio component for each use case ( decibel meter, udpstream, hermes)? What if you have a situation like the M5 echo where the i2s bus is shared for mic and speaker?

Dec 15 '22 03:12 mrhatman

@mrhatman please come on Discord, there's our dev channel where our devs are lively answering such questions!

Dec 15 '22 05:12 nagyrobi

@mrhatman, indeed, implementing just noise meter vs implementing something more generic and flexible might become complicated very quickly. I'm not sure if anyone could easily propose the best architecture/design - that's definitely one of the challenges for someone who's going to try this task. I think it would be helpful, if you provided your own vision, or a few alternatives varying by degree of flexibility/complexity, going from the the most basic (just single noise meter component) to the most flexible one, that might include multiple inputs/outputs.

Considering flexible solution, one might draw some inspiration from such libraries like Teensy Audio Library or Arduino Audio Tools, which operate on an interconnected graph of audio nodes of different types.

For example, you can draw some diagram in GUI design tool:

and export it to code:

AudioInputI2S            i2s1;           //xy=80,404
AudioFilterBiquad        biquad1;        //xy=355,180
AudioAnalyzeFFT256       fft256_1;       //xy=373,354
AudioAnalyzePeak         peak1;          //xy=385,593
AudioAnalyzeRMS          rms1;           //xy=392,447
AudioOutputI2S           i2s2;           //xy=681,314
AudioConnection          patchCord1(i2s1, 0, biquad1, 0);
AudioConnection          patchCord2(i2s1, 0, rms1, 0);
AudioConnection          patchCord3(i2s1, 0, fft256_1, 0);
AudioConnection          patchCord4(i2s1, 1, peak1, 0);
AudioConnection          patchCord5(biquad1, 0, i2s2, 0);

Applying this approach to ESPHome and noise meter task, a hypothetical config might look something like this:

audio:
  - platform: i2s_input
    id: i2s_1
    lrclk_pin: GPIO33
    dout_pin: GPIO22
    bclk_pin: GPIO19
    mode: mono
    sampling_rate: 48000

  - platform: iir_filter
    id: mic_eq
    b: [1, 2, 3]
    a: [3, 4, 5]

  - platform: noise_meter
    id: noise_meter_1
    freq_weighting: A
    time_weighting: fast
    Leq:
      name: LAeq_fast
    Lpeak:
      name: LApeak
    Lmax:
      name: LAmax_fast

  - platform: noise_meter
    id: noise_meter_2
    freq_weighting: C
    time_weighting: slow
    Leq:
      name: LCeq_slow
    Lpeak:
      name: LCpeak

  - connections:
    - source: i2s_1
      destination: mic_eq
    - source: mic_eq
      destination: noise_meter_1
    - source: mic_eq
      destination: noise_meter_2

Of course, I understand that it would require much more work than if you just implemented a single noise meter component, so maybe there could be other practical options, not as flexible as this approach, but flexible enough and simpler to implement.

PS. I'm not ESPHome nor audio expert, so don't take it too seriously.

Dec 15 '22 16:12 stas-sl

I am looking to take this task on.

<3

I have some prototypes working on some different mics. I just need a little help on the design of of this.

I am going to start with a basic decibel meter but after that it gets a little more complex. What should the output of a "generic microphone" be?

For use with ESPHome, I think the most value is in a component that works like what @stas-sl mocked up; a series of small/single-purpose components that are piped together as needed.

I would imagine that most people looking to use audio IN with ESPHome are trying to react to loud noises or some specific sound (glass breaking, dog barking, hands clapping) or possibly a few different words/spoken commands (which, really, are just filters for certain waveforms like the sound of glass breaking but are distinctly more challenging!). I'll come back to the spoken word / specific sounds and bi-directional audio in a sec... those are more complicated!

The algorithms for filtering audio out specific frequency bands or energy levels could be extensions of the existing ESPHome filters so it would then be pretty simple to have a basic "template number" that is incremented every time audio between X and Y hz and above Z dB is detected. With some basic tuning around the dB level, building a "when I clap three times quickly, turn this GPIO on/off" configuration in ESPHome should be easy enough which is ... probably enough for most use cases.

As for voice / bi-di audio: there's a reason why Alexa/GoogleHome are "dumb pipes" that are good at detecting a wake word and piping all the audio directly to some very beefy computers.

There are entire frameworks and platforms like ESPHome that are dedicated to the more complicated waveform/pattern-matching required for arbitrary command recognition. There's a whole ECOSYSTEM of training data / models / tuning for the models that then get plugged into the frameworks and I don't think "add TensorLite and TinyML into ESPHome" is a trivial task... certainly a bit more work than "implement basic microphone support"!

Additionally, figuring out if/how to gracefully degrade the experience on ESP32 devices that don't have dedicated peripherals is tricky. The S3 version does have a dedicated co-processor for this stuff but only works with models that Espressif supplies...I think. At any rate I remember that they have supported "wake words" and then you have to pay them to build you a new model for a new wake word the last time I looked into this ~ a year ago. I don't know if the process of building the model was every open sourced. I do not know what the plan/strategy would be for people that use the Arduino frameworks with ESPHome rather than the esp-idf framework. This might be one of those situations where you'll have to pick Arduino exclusive features (like web server) or audo reactivity (only implemented with esp-idf).

Could you use TensorFlow just on the CPU with a model / wake word of your choosing? Yes. Will there be enough additional CPU to also handle MQTT, BTLE connections, OTA updates and all the other stuff that ESPHome is doing? Maybe. Will TensorFlow/TinyML be easy to shove into the "ESPHome will call you every once in a while" loop/model? I don't know.

Ultimately scoping the work to a more basic "users can react to certain levels of energy in certain frequency bands" is achievable without expanding the work to "and also get TinyML integrated with ESPHome" and probably works better in the general ESPHome paradigm where your "do_work()" function will be called every once in a while by the main ESPHome loop rather than being the dedicated loop that's constantly waiting for new data from the FFT or possible an interrupt from the audo/ml co-processor"

Exposing (a portion of) the underlying API to get the raw samples should be possible for people wishing to do more complicated things like voice/word detection or encoding/streaming the audio to another computer but trying to add "first class" support for that might be tricky!

Can triggers pass arrays of data around or will we need to make a different audio component for each use case ( decibel meter, udpstream, hermes)? What if you have a situation like the M5 echo where the i2s bus is shared for mic and speaker?

More than a few ESPHome components use the "sensor on a bus" model. E.G.: you feed an instance of the modbus component into various sensors.

This might mean a slight re-factor of the existing i2s_audio.

From:

# Example configuration entry
media_player:
  - platform: i2s_audio
    name: ESPHome I2S Media Player
    dac_type: external
    i2s_lrclk_pin: GPIO33
    i2s_dout_pin: GPIO22
    i2s_bclk_pin: GPIO19
    mode: mono

to:


i2s:
  - id: main_audio_bus
    i2s_lrclk_pin: GPIO33
    i2s_dout_pin: GPIO22
    i2s_bclk_pin: GPIO19

media_player:
  - platform: i2s_audio
    i2s_bus: main_audio_bus
    dac_type: external
    mode: mono

microphone:
  - id: ext_mic
    platform: i2s_microphone
    i2s_bus: main_audio_bus

sensor:
  - platform: template
    name: "Clap Count"
    filters:
        # Only care about sounds between X and Y hz
        # Might also combine with OR: to react to multiple bands
        - audio_freaq_filter_out:
             lower: 4.3 kHz
             upper: 5.0 kHz
        # More filters here to check the db level and then to to sum up the number of times the db threshold is crossed

Thoughs?

Dec 15 '22 17:12 kquinsland

I am really likely the pipe layout for audio and modeling it off of arduino audio with the sources and sinks. Likely using callback to communicate buffers of data from component to component. You can create various audio sources:

I2s
ADC ( internal or external)
Bluetooth audio ( connect to you phone to stream music around)
Media Player

Different filters and passthrough operations:

Frequency filter
Pop rejection
Levelizer

And finally you could have different audio sinks:

I2S
DAC
Bluetooth speaker
Hermes
UDP audio stream
Tensorflow
Wakeword engine
Clap detector

i2s would likely be both a source and sink on devices but that should work.

Some examples:

#Existing Media Player
media_player:
  - id: media_player
 
i2s:
  - id: main_audio_bus
    i2s_lrclk_pin: GPIO33
    i2s_dout_pin: GPIO22
    i2s_bclk_pin: GPIO19
    dac_type: external
    mode: mono
    output_audio_source: media_player

#Decibel Meter
decibel_meter:
  - id: decibel_meter
    audio_source: main_audio_bus
 
i2s:
  - id: main_audio_bus
    i2s_lrclk_pin: GPIO33
    i2s_din_pin: GPIO22
    i2s_bclk_pin: GPIO19
    mode: mono

#Filter Example
decibel_meter:
  - id: decibel_meter
    audio_source: filtered_audio_bus
 
audio_filter:
  -id: filtered_audio_bus
   audio_source: main_audio_bus
i2s:
  - id: main_audio_bus
    i2s_lrclk_pin: GPIO33
    i2s_din_pin: GPIO22
    i2s_bclk_pin: GPIO19
    mode: mono

Dec 15 '22 18:12 mrhatman

Thanks for looking at this. Make sure you link up with the ESPHome team, since I think Nabu Casa are prioritising audio projects in 2023.

Look at the i2c examples for inspiration; I think it's exactly analogous. They use i2c components to define the buses, then sensor components that reference those bus_ids.

Dec 16 '22 10:12 jamesmyatt

I'm glad there is some kind of agreement on higher level that it should be a graph of nodes piped together to make it as flexible and extendable as possible. However I'd like to discuss some smaller details, which seem important as well. I don't have strong opinion on them, but I want to just explain what make sense to me and why.

I propose to move all audio related components under a separate config section/c++ namespace. The same way as light/fan/display components have their own section in config. Instead of scattering all these components like media_player, microphone on the top level.

audio:
  - platform: i2s
  - platform: media_player
  - platform: microphone
  - platform: gain
  - platform: equalizer
  - platform: noise_meter
  ...

Well, actually I'm not sure, if there should be microphone component at all, as it is just essentially i2s device/stream/bus whatever.

Specifying node connections. So, besides nodes themselves, you need to specify which outputs go into which inputs. There are probably several ways to do it: 1) as @mrhatman did it, by specifying audio_source or output_audio_source per each component, or 2) by specifying all the connections separately, after all nodes were declared, as I showed in my example above, inspired by how it is done in Teensy library. Either option should work, but IMHO, the second way is more clean and easy to understand, especially if there would be larger graph and each component would have multiple inputs/outputs. For example, the immediate question I have if it would be the 1st option, is where you should specify connections: in the child node specifying inputs, or in the parent node specifying outputs. I see both options audio_source and output_audio_source in the example - it looks a bit confusing to me. Unlike Teensy library, Arduino Audio Tools doesn't have a single way of specifying connections: sometimes they pass input stream to the constructor, sometimes output stream, and sometimes a separate stream copier class is responsible for data propagation. I haven't seen bigger examples having more than 3-4 nodes, but I guess it might become not very easy to understand how the flow goes.

I2SStream in;
I2SStream out; 

// copy filtered values
FilteredStream<int16_t, float> filtered(in, channels);  // Defiles the filter as BaseConverter
StreamCopy copier(out, filtered);               // copies sound into i2s

void loop() {
  copier.copy();
}

SineWaveGenerator<int16_t> sine_wave(32000);                         // subclass of SoundGenerator with max amplitude of 32000
GeneratedSoundStream<int16_t> in_stream(sine_wave);                  // Stream generated from sine wave
CsvStream<int16_t> out(Serial, to_channels);                         // Output to Serial
ChannelFormatConverterStreamT<int16_t> conv(out);
StreamCopy copier(conv, in_stream);                                  // copies sound to out
void loop() {
  copier.copy();
}

I don't know the exact syntax (I can propose a few), how you can specify connections separately in config, but I see benefits of declaring them outside of nodes themselves.

media_player component definitely has to be changed a lot with this approach. As I understand, currently it has 2 responsibilities: 1) loading/decoding media from url, 2) streaming audio data to i2s. With proposed approach the 2nd part will be implemented as a separate node/component like I2SStream/I2SDevice, so media_player should only be responsible for loading the data and piping it downstream. As I mentioned in the 1st point, I would propose to not use separate config section on the top level for media_player, but instead move it under audio section, the same way as many others possible audio sources.
@mrhatman, I'm not sure I fully understand how you are going to use callbacks, but I would highly recommend learning how it is implemented in those libraries I mentioned or maybe some others, to not reinvent the wheel. As I understand there should be some AudioStream base class with read/write methods and each node should be aware of the very next child nodes, so after the data is written to the parent node and it processed it somehow, it should pass it downstream to all children, either by calling for each child child->write(data) directly or by queueing/scheduling it. Of course, that is my naive and simplistic understanding and it has to be much more complicated than this.

Dec 16 '22 15:12 stas-sl

This initial discussion about I2S audio has kind of ballooned into a discussion about how do we want audio handling in ESP home so I agree with you @jamesmyatt we might want to move this discussion to include the ESP home team. I just would not know where to start there.

I am going to start prototyping up 2 proof of concepts with the node architecture and see how people feel about the implementation:

an I2S microphone that feeds into a basic decibel / RMS meter
a sine wave generator that feeds into a I2S speaker on the same bus as the mic

I feel these are the basic demos that will give a feel for how the audio handling will work

@stas-sl to respond to your thoughts:

I think this makes sense, I like things being organized. Microphone will be replaced by I2S in or something like that. it is just an audio source
To clarify a bit here, an I2S bus in my example could be both a input and output. Output audio source was poorly named. I think a bus setup like @jamesmyatt suggested like clarify that better.
I agree, the existing media player will just be an audio source with some nice to haves, pausing, playing, etc and then we can forward it to any number of audio sinks like I2S speaker, bluetooth speaker, etc.
The callbacks were basically the equivalent of read/write as the data from the audio source might be intermittent.

Dec 16 '22 16:12 mrhatman

https://github.com/syssi/esphome-zeroconf https://github.com/esphome/esphome/pull/4202 Could be used to announce media services on the network with mDNS

Dec 18 '22 13:12 nagyrobi

First pass of the decibel meter using I2S audio

https://github.com/esphome/esphome/commit/d7fc67c5daaab408fe80d4b9d254fcfd306dc026

Not finished and needs a lot of work but it works for me at least.

Tested on m5 echo using this config:

audio_source:
  - platform: i2s_audio_source
    id: i2s_mic
    name: "i2s_mic"
    i2s_lrclk_pin: GPIO33
    i2s_din_pin: GPIO23
    i2s_bclk_pin: GPIO19

sensor:
  - platform: decibel_meter
    sensitivity: -23
    audio_source: i2s_mic
    name: decibels

If I should make a PR to track instead let me know.

Dec 19 '22 02:12 mrhatman

If I should make a PR to track instead let me know.

Looks good. You'll want a PR once you're ready for review from @jesserockz so may as well open one now.

Dec 19 '22 04:12 kquinsland

Anyone could please share a list with supported hardware?

Dec 19 '22 06:12 nagyrobi

I think you need to add fields like "update_interval",

Dec 19 '22 09:12 jamesmyatt

Anyone could please share a list with supported hardware?

As of right now, the only tested microphone is the SPM1423 PDM but any mic that supports an I2S interface should work

Dec 19 '22 14:12 mrhatman

How do we want to handle audio specifications for the connections? Mono/Stereo, Bytes per sample, Frequency, etc. The current example is hardcoded to 16kHz, 16 bit, Mono

I think the easiest way would be to have the audio sources declare their configuration and let the receiving component handle the conversion if needed.

audio_source:
  - platform: i2s_audio_source
    id: i2s_mic
    name: "i2s_mic"
    i2s_lrclk_pin: GPIO33
    i2s_din_pin: GPIO23
    i2s_bclk_pin: GPIO19
    bits_per_sample: 16
    audio_freqency: 16000
    mode: mono

sensor:
  - platform: decibel_meter
    sensitivity: -23
    audio_source: i2s_mic
    name: decibels

Dec 19 '22 14:12 mrhatman

feature-requests feature-requests copied to clipboard

Add support for I2S microphone / I²S audio input

feature-requests
feature-requests copied to clipboard