ESP32 + I2S microphone (INMP441) to implement speech recognition

Introduction.
Connect INMP441 (I2S microphone) to ESP32
Implementation of microphone processing class
operation check
1. Build and Write
2. Serial monitor check

Introduction.

Developing “Mia,” a talking cat-shaped robot that speaks in various dialects.

おしゃべり猫型ペットロボット「ミーア」は、100以上の種類の豊かな表情で、全国47の方言（大阪弁・博多弁・京都弁・沖縄弁など）を話します。毎日の中に、ちょっとした幸せと便利を。誕生日や記念日のプレゼントにピッタリです

So far, we have only provided one-way audio output, but now we would like to finally start developing a two-way system.

Since the downstream speech processing of speech synthesis → MQTT → ESP32 playback has already been implemented, the part of speech recognition → conversion to text by STT should be implemented for two-way dialogue.

Connect INMP441 (I2S microphone) to ESP32

In this case, an INMP441 I2S microphone is used as the voice recognition microphone and connected to the ESP32.

Amazon.co.jp: 5個 INMP441全方向性マイクモジュール MEMS高精度低電力I2Sインターフェース ESP32をサポート : 楽器・音響機器

There are two main types of voice recognition microphones: PDM microphones and I2S microphones, and the differences are described in this article.

PDMマイクとI2Sの違い：PDMでは、なぜデシメーションフィルタが必要なのか？

PDMマイクは、デジタル方式のマイクで、音声信号をパルスの密度で表現する。変換された信号は「高密度の1と0」で構成されており、人間の可聴領域（20Hz〜20kHz）を超える高周波成分も含んでいる。一方でI2Sマイクは音声データをPCM形式で...

Since ESP32 employs a flexible GPIO matrix and I2S signal lines (SCK, WS, SD) can be assigned to any GPIO, it is not necessary to separately connect to the following GPIO pins.

However, since SD corresponds to data input, it is better to assign one of the ESP32’s input-compatible pins (GPIO 34 to 39). Since the input impedance of the input-only pins (34 to 39) is high as input-only, the signal is easily retained since it is difficult for current to flow through them from the signal source (microphone). However, when handling digital signals such as I2S microphones, as long as the microphone and ESP32 are wired directly over a short distance, the output-enabled pins make little difference in practical use.

VDD → 3.3V: Operating voltage is 3.3V only, so do not connect to 5V
GND → GND
SD (serial data) → GPIO34
WS (word select) → GPIO25
SCK (serial clock) → GPIO33
L/R → GND (for left channel): If the input is to be stereo, two I2S microphones must be prepared and connected to GND and VDD respectively to be the left and right channels. However, stereo is only necessary when a sense of space or presence is required. In this case, the distinction between left and right is unnecessary because “what was said” is important for voice recognition and dialogue. In other words, mono is fine.

Since the purpose of this project was verification, we connected INMP441 to a development board we had on hand with jumper wires as shown below.

Implementation of microphone processing class

In microphone.cpp, implement a class that handles input from the I2S microphone.

In this case, we only want to quickly verify if the signal is being processed from the I2S microphone, so we will perform the following steps: “Receive signal from I2S microphone → Display volume level and sample statistics in real time.

Suppling rate (SAMPLE_RATE): 16 kHz

How many times per second sound waves are measured and the change in time (= frequency = height) is recorded. Unit: Hz (number of samples per second). The higher the sound, the faster the wave oscillates, so if the number of sampling is small, the waveform will not be clear, and as a result, high pitched sounds cannot be recorded well.

For the human audible band (~20 kHz)

16 kHz: optimal for speech recognition and call quality (Google STT is also sufficient at this frequency)
8 kHz: Telephone sound quality (slightly inferior, but very low bandwidth) → sounds like radio or telephone, a little rough
44.1kHz or 48kHz: Music quality (overspecification)

→Set to 16 kHz

Bit depth (bits_per_sample): 16bit

How finely the loudness (amplitude) of a single sound is expressed in numbers; how many bits (precision) a sample is expressed in; 256 steps (0 to 255) for 8-bit and 65,536 steps (-32,768 to +32,767) for 16-bit.
Generally 16-bit or 24-bit (processed at 32-bit maximum)
I2S often uses a 32-bit wide receive -> upper 16 or 24 bits

16-bit is sufficient for spoken dialogue.

Buffer size (read unit): 512

BUFFER_SIZE = size of samples[] (= number of samples to be read at one time). If too small, samples will come a little at a time each time, making processing difficult. On the other hand, if it is too large, too much sound will be accumulated and the response will be delayed (i.e., delayed) even though it was spoken.

→Set to medium (512)

DMA buffer setting ( dma_buf_count / dma_buf_len)

The ESP32’s I2S uses DMA (Direct Memory Access) to efficiently handle the audio buffer, automatically receiving and storing sounds in memory without the ESP32 having to fetch them.
Microphone → DMA buffer → ESP32 reads it all together.

If the DMA buffer is full (dma_buf_count * dma_buf_len) and the ESP32 has not yet been read, the newly arrived voice data will be “overwritten” (= missing data).

[INMP441]
   │
   ▼
+--------+   +--------+   +--------+   +--------+
| バケツ1 |→ | バケツ2 |→ | バケツ3 |→ ... (dma_buf_count)
+--------+   +--------+   +--------+
     ↑
     └── ESP32が「空いたバケツ」を順番に取りに行く

If so, it would be safe to make both dma_buf_count and dma_buf_len as large as possible, but data accumulates in RAM, and since the ESP32’s RAM is only 520KB, we do not want to waste too much memory. Therefore, it is important to find a good balance.

Of the 520 KB of RAM, at most 300-350 KB can be used freely by applications (the rest is for Wi-Fi and stacks). The buffer for voice data uses continuous memory, so if it is too large, it will be depleted quickly.

Memory used = dma_buf_count × dma_buf_len × (size of sample)

Ruby

dma_buf_count = 6
dma_buf_len = 512
1サンプル = 4バイト（32bit）
→ 6 × 512 × 4 = 12,288バイト（約12KB）

# 16個 × 2048にすると
16 × 2048 × 4 = 131,072バイト（128KB）！
→ 危険！ 他の処理とぶつかるかも💥

Therefore, dma_buf_len = 512, dma_buf_count = 6-8 is stable

C++

#include 
#include 

// I2S設定用の定数
#define I2S_MIC_SD_PIN 34    // シリアルデータ
#define I2S_MIC_WS_PIN 25    // ワードセレクト (LRクロック)
#define I2S_MIC_SCK_PIN 33   // シリアルクロック
#define I2S_PORT I2S_NUM_0
#define SAMPLE_RATE 16000 
#define SAMPLE_BITS 32
#define BUFFER_SIZE 512

// テスト用のバッファ
int32_t samples[BUFFER_SIZE];

void setup() {
  Serial.begin(115200);
  delay(1000);
  Serial.println("I2Sマイクロフォンテスト開始");

  // I2S設定
  i2s_config_t i2s_config = {
    .mode = (i2s_mode_t)(I2S_MODE_MASTER | I2S_MODE_RX),
    .sample_rate = SAMPLE_RATE,
    .bits_per_sample = (i2s_bits_per_sample_t)SAMPLE_BITS,
    .channel_format = I2S_CHANNEL_FMT_ONLY_LEFT,  // INMP441は左チャンネルを使用
    .communication_format = I2S_COMM_FORMAT_STAND_I2S,
    .intr_alloc_flags = ESP_INTR_FLAG_LEVEL1,
    .dma_buf_count = 6,
    .dma_buf_len = BUFFER_SIZE,
    .use_apll = false,
    .tx_desc_auto_clear = false,
    .fixed_mclk = 0
  };

  i2s_pin_config_t pin_config = {
    .bck_io_num = I2S_MIC_SCK_PIN,
    .ws_io_num = I2S_MIC_WS_PIN,
    .data_out_num = I2S_PIN_NO_CHANGE,
    .data_in_num = I2S_MIC_SD_PIN
  };

  // I2Sドライバのインストール
  esp_err_t result = i2s_driver_install(I2S_PORT, &i2s_config, 0, NULL);
  if (result != ESP_OK) {
    Serial.printf("I2Sドライバのインストールに失敗: %dn", result);
    return;
  }

  // I2Sピン設定
  result = i2s_set_pin(I2S_PORT, &pin_config);
  if (result != ESP_OK) {
    Serial.printf("I2Sピン設定に失敗: %dn", result);
    i2s_driver_uninstall(I2S_PORT);
    return;
  }

  Serial.println("I2S初期化完了");
}

void loop() {
  // バッファをクリア
  memset(samples, 0, sizeof(samples));
  
  // I2Sからデータを読み込む
  size_t bytes_read = 0;
  esp_err_t result = i2s_read(I2S_PORT, samples, sizeof(samples), &bytes_read, portMAX_DELAY);
  
  if (result != ESP_OK) {
    Serial.printf("I2S読み込みエラー: %dn", result);
    delay(1000);
    return;
  }

  // 読み込んだサンプル数
  int samples_read = bytes_read / sizeof(int32_t);
  Serial.printf("読み込んだサンプル数: %dn", samples_read);

  // サンプル値の統計情報
  int32_t min_sample = INT32_MAX;
  int32_t max_sample = INT32_MIN;
  float avg_sample = 0;
  
  for (int i = 0; i > 16;
    
    if (sample  max_sample) max_sample = sample;
    avg_sample += sample;
    
    // 最初の10サンプルだけ表示（確認用）
    if (i  0) {
    avg_sample /= samples_read;
    Serial.printf("統計: 最小=%d, 最大=%d, 平均=%.2f, 範囲=%dn", 
                  (int16_t)min_sample, (int16_t)max_sample, avg_sample, (int16_t)(max_sample - min_sample));
  }

  // 音量レベル表示（簡易的なもの）
  int16_t level = max_sample - min_sample;
  Serial.print("音量レベル: ");
  for (int i = 0; i < level / 100; i++) {
    Serial.print("#");
  }
  Serial.println();
  
  delay(500); // 0.5秒待機
}

The above code specifically checks the following

checklist	Purpose	if it’s OK	If NG.
Is the sample changing?	Normalcy of wiring, I2S settings, and recording	✅ Wiring & settings OK	❌ Wiring error, I2S setting error, all 0 samples, etc.
Respond to the sound # or answer it.	Visualization of volume response	✅ I can detect audio input from the microphone.	❌ Sound does not increase #.
Is there a min/max/avg change?	Check volume and signal strength	✅ Amplitude changes are detected normally.	❌ Silent but wide range or 0 all the time

Sample value (the instantaneous intensity (amplitude) of the sound obtained in one sampling)

All “0” or “-1” → Wiring error, I2S setting error

Does silence of the microphone result in lower min/max/avg?

When there is no sound, the microphone output signal is “only fine noise near 0
When there is sound, the amplitude increases and the min/max range also increases

Expected numerical change

condition	smallest	largest	average	Range (max-min)
long silence	-5 to +5	-3 to +4	≈ 0	5-10
Normal voice	-1000	+1200	≈ ± several hundred	thousands
large voice	-2000	+3000	≈ ±1000	Thousands to more than tens of thousands of dollars.

If the “value changes” before and after the sound is produced, it is evidence that the microphone is picking up sound normally.

Build target files in platform.ini

platform.ini

C++

[env:microphone_test]
platform = espressif32 @ 6.7.0
board = esp32dev
framework = arduino
board_upload.flash_size = 4MB
build_flags =
    ${env.build_flags}
    -DCORE_DEBUG_LEVEL=5
build_src_filter = - +
extra_scripts =
board_build.partitions = default.csv

The build_src_filter parameter is a function for specifying the source files to be built.
+ – specification that only this file is to be included.

By default, PlatformIO targets all .cpp, .c, and .S files in the src folder for build, but this setting causes only test code to be built.

operation check

Build and Write

By specifying the environment when executing the build command (in this case, microphone_test), only this file can be the target of the build.

ShellScript

pio run -e microphone_test -t upload

Serial monitor check

-e microphone_test option specifies the environment to be used for monitoring

ShellScript

pio run -e microphone_test -t monitor

The serial monitor displays the following information

Number of samples read
Value of the first 10 samples
Statistics (minimum, maximum, average, range)
Visual indication of volume level (indicated by # symbol)

ShellScript


サンプル[9]: 0
統計: 最小=-1760, 最大=485, 平均=30.92, 範囲=2245
音量レベル: ######################
読み込んだサンプル数: 512
サンプル[0]: 182
サンプル[1]: 213
サンプル[2]: 0
サンプル[3]: 0
サンプル[4]: 0
サンプル[5]: 0
サンプル[6]: 0
サンプル[7]: 0
サンプル[8]: 0
サンプル[9]: 0
統計: 最小=-1239, 最大=310, 平均=26.49, 範囲=1549
音量レベル: ###############
読み込んだサンプル数: 512
サンプル[0]: -952
サンプル[1]: -1261
サンプル[2]: -1376
サンプル[3]: -1060
サンプル[4]: -772
サンプル[5]: -759
サンプル[6]: -415
サンプル[7]: -147
サンプル[8]: -161
サンプル[9]: -81
統計: 最小=-1376, 最大=260, 平均=35.48, 範囲=1636
音量レベル: ################

At first, the sound was only showing 0, and the wiring was supposed to be correct. But, as it turned out, the development board was not functioning properly. I changed the board and it worked.

Now that we have confirmed the operation of INMP441, the next step is to implement the part that sends audio to the server.