There are many cool projects you can do with a cheap MEMS microphone, an ESP32 and Fast Fourier Transform (FFT). For me the most exciting one was to transform my classic door bell and smoke alarm into “smart” alarms! Also, using FFT I can monitor the ambient noise inside my apartment. If I’m away and someone rings the door or the smoke alarm goes off, I instantly get a notification on my phone.

## Fast Fourier Transform (FFT)

A Fast Fourier transform algorithm allows us to decompose a signal (in this case the sound) from the time domain to the frequency domain. It basically means that if we measure the sound over a period of time we can calculate the frequencies that created it. Knowing that my door bell rings at 1kHz and my smoke alarm at 3kHz, if these two frequencies are the dominant ones for more than 500ms, I can publish a MQTT message to home-assistant and send a notification on my phone.

There is a lot of theory behind FFT and you don’t have to understand it all to implement this example.

The most important parameters of FFT that you need to understand are:

- The sample rate or sampling frequency (
**fs**)

It is measured in Hz and it is basically the number of measurements per second e.g 48kHz. For an audio signal, this is usually the upper limit of your microphone as defined in the datasheet. The higher the sampling frequency, the higher the frequencies we can detect. - The number of samples or block length (
**BL**)

This is the number of measurements we use for our calculation and it is always as a power of two. e.g: 8,16,32,… 1024, 2048 . The higher the number, the more accurate frequencies we can detect. However, more samples mean more computation required, so it is up to you to set this number based on your computing power and accuracy needs. - The measurement duration (
**D**)

This is calculated as the time required to take all the required samples. If our sample rate is 48kHz, this means the microphone can take 48000 measurements in one second. But if we only need 1024 measurements, the duration is*D = BL/fs*= 1024/48000 = 21.3 ms. - The frequency resolution (
**df**)

This is the spacing between two frequency results and it is defined as*df = fs/BL*= 48000/1024 = 46.88 Hz . In practice this means that it will be impossible to distinguish between a frequency of 4670Hz and 4680Hz, because the difference is less than the resolution. - Nyquist frequency
**(fn)**Based on Niquist theory, this is the maximum frequency that can be accurately determined by FFT and it is calculated as

*fn = fs / 2*. So we need a sample rate of at least 48kHz to be able to detect a frequency of 24kHz (The range for human hearing is from 20Hz to 20kHz ).

Every FFT implementation takes as input an array of *BL* values ( BL = 1024 in our above example). It is up to us to make sure these values are sampled correctly (at a fixed sample rate)! The result is also an array of the same size as the input (1024 returned values). We call these values, **bins. **The value of each bin represents the amplitude of a frequency in the measurement. When our doorbell rings, the value of the 1kHz bin will be very high compared with the other bins.

Each bin has a range equal with the frequency resolution (df). So bin[0] will represent the frequencies from 0Hz to 46.88Hz, bin[2] represents 46.88Hz to 93.76Hz and so on. However, due to Niquist theory, only half of the bins contain good values (in our example covering from 0Hz to 24kHz – bin[512]). This is half of the sample rate.

As an example, if we want to get the amplitude for 1kHz for an audio signal sampled at 48kHz with 1024 samples, we will look at the bin 21 (1000Hz/**df** = 1000Hz/46.88Hz = 21.33). Bin 21 actually covers frequencies from 984.48Hz to 1031.36Hz, hence the decimal value for the bin.

## Connecting INMP441 I2S microphone to ESP32

A good cheap microphone is INMP441 and you can find it on AliExpress for about 3$. INMP441 is an omnidirecțional digital MEMS microphone, a type of microphone that is used in most modern voice recognition devices like Google Home or Alexa. For this price, the specification are quite good with a flat frequency response from 60 Hz to 15 kHz.

INMP441 has an i2s interface and can be directly connected to ESP32 without any extra components required. For this microphone, 6 wires are required and the connection can be done like this:

- INMP441 GND to ESP32 GND
- INMP441 VDD to ESP32 3.3V
- INMP441 SD to ESP32 GPIO32/D32
- INMP441 SCK to ESP32 GPIO14/D14
- INMP441 WS to ESP32 GPIO15/D15
- INMP441 L/R to ESP32 GND

Make sure to connect L/R to GND, otherwise the microphone will produce noise.

These pins can be changed and configured with `i2s_pin_config_t`

.

## Audio spectrum analyser with Friture

Friture is a free real-time audio analyser for linux, mac and windows. You can use it to check the exact pattern and frequency of your trigger sound: the door bell or fire alarm.

This screenshot is done while playing my door bell sound. We can clearly identify the pattern and the peak frequencies just by looking at the image on the bottom left. You can see that the bell plays 1.5kHz for 150ms, then 1kHz for 50ms, again 1.5kHz for 150ms followed by a long 800ms 1kHz sound.

In the top graph we can also see the peak frequencies more precisely.

Once you have the composing frequencies of your trigger sound, you can start thinking now about the easiest way to implement it with ESP32 FFT library.

## Door bell and fire alarm detector using FFT on ESP32

As I described in the first part, the output of FFT is an array of values that correspond to the amplitudes of all the frequencies in the range. This mean we can simply detect the peak frequency or get the amplitude of a frequency at any given time.

However, just comparing the peak frequency with the trigger frequency (1kHz for my door bell) is not enough. When music or tv is playing, having 1kHz dominant for a fraction of a second is not unusual and a we will get a false positive trigger.

The way I implemented the frequency matching over a longer period of time is using a FIFO queue of 1s and 0s. For each FFT, if the peak frequency is matching the required frequency (1kHz), I add a 1 to the queue, otherwise a 0. Knowing how long each FFT takes (e.g. 21.3ms), a queue of 00001111 means that in the past 170.4ms (8×21.3ms), 1kHz was the dominant frequency for the last 85.2ms.

My queue is basically a 32 bits integer. For each computation, I shift the bits to the left and add a 1 or 0 at the end depending if the dominant frequency matches the trigger frequency. Each integer can only “store” one frequency for the past 32***D**uration milliseconds, but this is enough for my case. A combination of integers matching different frequencies can “detect” more complex sound patterns.

bool detectFrequency(unsigned int *mem, unsigned int minMatch, double peak, unsigned int bin) { /* * *mem is a pointer to our "queue". 32 bits int * minMatch is the minimum number of 1s to have in the queue to trigger the alarm * peak is the peak bin detected * bin is the first desired frequency bin * * returns true if dominant frequency is detected more than minMatch times over 32*D ms */ // shift bits left *mem = *mem << 1; if (peak == bin ) { //set last bit to 1 if peak bin matches desired bin *mem |= 1; } // how many bits are 1? return true if over threshold if (countSetBits(*mem) >= minMatch) { return true; } return false; }

The code I used for this on my ESP32 is an adaption from EspArduinoSensor project. The full code looks like this:

#include <Arduino.h> #include <driver/i2s.h> #include "arduinoFFT.h" // size of noise sample #define SAMPLES 1024 const i2s_port_t I2S_PORT = I2S_NUM_0; const int BLOCK_SIZE = SAMPLES; #define OCTAVES 9 // our FFT data static float real[SAMPLES]; static float imag[SAMPLES]; static arduinoFFT fft(real, imag, SAMPLES, SAMPLES); static float energy[OCTAVES]; // A-weighting curve from 31.5 Hz ... 8000 Hz static const float aweighting[] = {-39.4, -26.2, -16.1, -8.6, -3.2, 0.0, 1.2, 1.0, -1.1}; static unsigned int bell = 0; static unsigned int fireAlarm = 0; static unsigned long ts = millis(); static unsigned long last = micros(); static unsigned int sum = 0; static unsigned int mn = 9999; static unsigned int mx = 0; static unsigned int cnt = 0; static unsigned long lastTrigger[2] = {0, 0}; static void integerToFloat(int32_t *integer, float *vReal, float *vImag, uint16_t samples) { for (uint16_t i = 0; i < samples; i++) { vReal[i] = (integer[i] >> 16) / 10.0; vImag[i] = 0.0; } } // calculates energy from Re and Im parts and places it back in the Re part (Im part is zeroed) static void calculateEnergy(float *vReal, float *vImag, uint16_t samples) { for (uint16_t i = 0; i < samples; i++) { vReal[i] = sq(vReal[i]) + sq(vImag[i]); vImag[i] = 0.0; } } // sums up energy in bins per octave static void sumEnergy(const float *bins, float *energies, int bin_size, int num_octaves) { // skip the first bin int bin = bin_size; for (int octave = 0; octave < num_octaves; octave++) { float sum = 0.0; for (int i = 0; i < bin_size; i++) { sum += real[bin++]; } energies[octave] = sum; bin_size *= 2; } } static float decibel(float v) { return 10.0 * log(v) / log(10); } // converts energy to logaritmic, returns A-weighted sum static float calculateLoudness(float *energies, const float *weights, int num_octaves, float scale) { float sum = 0.0; for (int i = 0; i < num_octaves; i++) { float energy = scale * energies[i]; sum += energy * pow(10, weights[i] / 10.0); energies[i] = decibel(energy); } return decibel(sum); } void setup(void) { Serial.begin(115200); Serial.println("Configuring I2S..."); esp_err_t err; // The I2S config as per the example const i2s_config_t i2s_config = { .mode = i2s_mode_t(I2S_MODE_MASTER | I2S_MODE_RX), // Receive, not transfer .sample_rate = 22627, .bits_per_sample = I2S_BITS_PER_SAMPLE_32BIT, .channel_format = I2S_CHANNEL_FMT_ONLY_LEFT, // for old esp-idf versions use RIGHT .communication_format = i2s_comm_format_t(I2S_COMM_FORMAT_I2S | I2S_COMM_FORMAT_I2S_MSB), .intr_alloc_flags = ESP_INTR_FLAG_LEVEL1, // Interrupt level 1 .dma_buf_count = 8, // number of buffers .dma_buf_len = BLOCK_SIZE, // samples per buffer .use_apll = true}; // The pin config as per the setup const i2s_pin_config_t pin_config = { .bck_io_num = 14, // BCKL .ws_io_num = 15, // LRCL .data_out_num = -1, // not used (only for speakers) .data_in_num = 32 // DOUTESP32-INMP441 wiring }; // Configuring the I2S driver and pins. // This function must be called before any I2S driver read/write operations. err = i2s_driver_install(I2S_PORT, &i2s_config, 0, NULL); if (err != ESP_OK) { Serial.printf("Failed installing driver: %d\n", err); while (true) ; } err = i2s_set_pin(I2S_PORT, &pin_config); if (err != ESP_OK) { Serial.printf("Failed setting pin: %d\n", err); while (true) ; } Serial.println("I2S driver installed."); } unsigned int countSetBits(unsigned int n) { unsigned int count = 0; while (n) { count += n & 1; n >>= 1; } return count; } //detecting 2 frequencies. Set wide to true to match the previous and next bin as well bool detectFrequency(unsigned int *mem, unsigned int minMatch, double peak, unsigned int bin1, unsigned int bin2, bool wide) { *mem = *mem << 1; if (peak == bin1 || peak == bin2 || (wide && (peak == bin1 + 1 || peak == bin1 - 1 || peak == bin2 + 1 || peak == bin2 - 1))) { *mem |= 1; } if (countSetBits(*mem) >= minMatch) { return true; } return false; } void sendAlarm(unsigned int index, char *topic, unsigned int timeout) { // do not publish if last trigger was earlier than timeout ms if (abs(millis() - lastTrigger[index]) < timeout) { return; } lastTrigger[index] = millis(); //publish to mqtt //publish(topic, "1"); } void sendMetrics(char * topic, unsigned int mn, unsigned int mx, unsigned int avg) { String payload = "{\"min\": "; payload += mn; payload += ", \"max\":"; payload += mx; payload += ", \"value\":"; payload += avg; payload += "}"; Serial.println(payload); //publish to mqtt //publish(topic, (char *)payload.c_str()); } void calculateMetrics(int val) { cnt++; sum += val; if (val > mx) { mx = val; } if (val < mn) { mn = val; } } void loop(void) { if (micros() - last < 45200) { // send mqtt metrics every 10s while waiting and no trigger is detected if (millis() - ts >= 10000 && bell == 0 && fireAlarm == 0) { //Serial.println(cnt[0]); sendMetrics("home/noise", mn, mx, sum / cnt); cnt = 0; sum = 0; mn = 9999; mx = 0; ts = millis(); } return; } last = micros(); static int32_t samples[BLOCK_SIZE]; // Read multiple samples at once and calculate the sound pressure size_t num_bytes_read; esp_err_t err = i2s_read(I2S_PORT, (char *)samples, BLOCK_SIZE, // the doc says bytes, but its elements. &num_bytes_read, portMAX_DELAY); // no timeout int samples_read = num_bytes_read / 8; // integer to float integerToFloat(samples, real, imag, SAMPLES); // apply flat top window, optimal for energy calculations fft.Windowing(FFT_WIN_TYP_FLT_TOP, FFT_FORWARD); fft.Compute(FFT_FORWARD); // calculate energy in each bin calculateEnergy(real, imag, SAMPLES); // sum up energy in bin for each octave sumEnergy(real, energy, 1, OCTAVES); // calculate loudness per octave + A weighted loudness float loudness = calculateLoudness(energy, aweighting, OCTAVES, 1.0); unsigned int peak = (int)floor(fft.MajorPeak()); //Serial.println(peak); // detecting 1kHz and 1.5kHz if (detectFrequency(&bell, 15, peak, 45, 68, true)) { Serial.println("Detected bell"); sendAlarm(0, "home/alarm/doorbell", 2000); } //detecting frequencies around 3kHz if (detectFrequency(&fireAlarm, 15, peak, 140, 141, true)) { Serial.println("Detected fire alarm"); sendAlarm(1, "home/alarm/fire", 10000); } calculateMetrics(loudness); }

To compile this code you need themodifiedArduino FFT files [header and cpp] from EspArduinoSensor project! Just place these in the same folder as the .ino file

Wifi and MQTT code not added, but there are plenty of examples online that you can adapt to your case.

Now let’s look at the code a bit and identify the FFT parameters.

- Sample rate (fs) is 22627Hz (line 87)
- Number of samples (BL) is 1024 (line 6)
- Measurement duration (D)* is BL/fs = 45.2ms
- Frequency resolution (df) is fs/BL = 22Hz
- Nyquist frequency (fn) is fs/2 = 11.3kHz

Note: The measurement duration for 1024 samples takes 45.2ms, but i2s_read function returns data instantly and the loop function is executed in around 4ms. This is because the i2s driver buffers the data at all times (in this code it is set to buffer 8×1024 samples – lines 99-100). Without the sleep loop on line 203, we will calculate FFT every 4ms for the past 45.2ms. This means 41.2ms of overlapped samples all the time and of course a shorter time queue using our 32 bits integer (32*4ms instead of 32*45.2ms).

So for my door bell that rings at exactly 1kHz, I need to check that the peak bucket is equal with 45 ( 1kHz/df ). For my fire alarm that rings at 3kHz, the peak is 140.

The threshold I need for my door bell is over 15 ( < 800ms/45.2ms). This is where you need to do some trial and error. The peak might fall in the bucket before or after your trigger frequency depending on how precise the frequency in the alarm is and the microphone accuracy.

## ESP32 as decibel meter

Since we are running FFT continuously to detect the door bell, it will be a waste of resource not to record the ambient noise as well. I am using a simple function (line 186) to calculate the minimum, maximum and average noise levels every 10 seconds and I send these metrics to mqtt/influxdb.

It is cool to see how quiet it is at night (around 8 db) compared to daytime (yeah, I listen to loud music during the day!)

Hi,

What a great article. Currently, I’m building something similar myself and I would like to ask 2 questions:

1. On line 16, shouldn’t be the last parameter sampling frequency instead of SAMPLES?

2. Could you please explain in more detail the function integerToFloat? I don’t understand why the original data are shifted right by 16 bits and then divided by 10.

Thank you for you help!

Hi,

It’s a very didactic project, one of the best I’ve found on I2S.

But, In line 229, why divide by 8?

Is a single Sample 24 Bits?

Would it be correct to divide by 4?

4 bytes = 32 Bits = I2S_BITS_PER_SAMPLE_32BIT = Number of Samples

Can you help me understand?

Thanks.

Hi,

It’s a very didactic project, one of the best I’ve found on I2S.

But, In line 229, why divide by 8?

Is a single Sample 24 Bits?

Would it be correct to divide by 4?

4 bytes = 32 Bits = I2S_BITS_PER_SAMPLE_32BIT = Number of Samples

Can you help me understand?

Thanks.

Actually that variable is not even used, I don’t remember why it is there

This is great work! Any chance you could make this into an ESP Home module? It would make a significantly more accessible and usable for users.

Hi all. I want to recognize a siren to open a barrier for ambulances. Can you give me a hint on how to remake the code to recognize special signals?

Hi all. I want to recognize a siren to open a barrier for ambulances. Can you give me a hint on how to remake the code to recognize special signals?