Difference between PDM microphones and I2S: Why does PDM require a decimation filter?

Introduction.
PDM: Pulse density representation of speech signals
I2S microphone: Transmits audio data in PCM format
Specific examples of I2S and PDM microphones
1. I2S microphone: 44.1 kHz / 16-bit quantization example
2. PDM microphone: 3 MHz / 1 bit example
3 signal lines required for I2S protocol: BLCK/SCK, WS, and DATA

Introduction.

In implementing two-way voice communication, we learned that there are two voice recognition microphones, the PDM microphone and the I2S microphone, and we will summarize the differences.

PDM: Pulse density representation of speech signals

PDM (Pulse Density Modulation) microphones are digital microphones that express audio signals in terms of pulse density.

Simple hardware configuration: PDM is often used for low-cost microphones with simple wiring.
Processing complexity: PDM signals cannot be used as-is as audio and require a decimation filter (process to remove high-frequency components). This process must be implemented programmatically, which increases the burden on software.

The audio signal is directly output as a “1-bit pulse train”, with louder parts having a higher density of “1s” and quieter parts having a higher density of “0s”. The converted signal consists of “high-density 1s and 0s” and includes high-frequency components that exceed the human audible range (20 Hz to 20 kHz).

Since these high-frequency components cause “noise mixing” and “distortion” when audio is output from speakers, it is necessary to remove the high-frequency components to obtain smooth audio waveforms that can be heard by humans.

https://jp.sameskydevices.com/blog/pdm-vs-i2s-comparing-digital-interfaces-in-mems-microphones

I2S microphone: Transmits audio data in PCM format

I2S (Inter-IC Sound) microphones are digital microphones that transmit audio data using the I2S protocol. I2S is a standard interface for audio and is widely used in audio equipment.

Simple software processing: I2S microphones output audio data already in PCM (pulse code modulation) format, so no special filter processing is required and the data can be handled as is.
High sound quality: I2S is less susceptible to noise and tends to have better sound quality than PDM.
Example of external connection: I2S microphones such as INMP441 can be easily connected to ESP32.

I2S transmits voice data in PCM (Pulse Code Modulation) format; PCM is a direct conversion of voice waveforms into digital data. In this conversion process, the analog signal is sampled at regular intervals, quantized, and recorded as a digital value.

Analog audio sampling

Convert analog audio to digital data at a constant sampling rate (e.g., 44.1 kHz)

Quantization (bit depth)

Representation of audio amplitude as a digital value such as 16-bit or 24-bit

Transmission with I2S protocol

The generated digital data is sent using a serial communication method in which the left and right channels are transmitted alternately.
Three lines, the clock signal (SCK), word select signal (WS), and data signal (DATA), are used to send voice data in a timely manner. The role of the three signal lines required for this I2S protocol is described below

Specific examples of I2S and PDM microphones

Analog signal (continuous waveform)

Suppose that the analog audio waveform is as follows.

アナログ波形：  ↗︎   ↘︎   ↗︎   ↘︎   ↗︎
振幅値（イメージ）：0.2  0.5  0.8  0.5  0.2

I2S microphone: 44.1 kHz / 16-bit quantization example

Quantization range: 0 to 65,535
Amplitude of 0.2 → 13,107 (65,535 × 0.2)
Amplitude of 0.5 → 32,767 (65,535 × 0.5)
Amplitude of 0.8 → 52,428 (65,535 × 0.8)

Data after digitization (16 bits):

13,107  32,767  52,428  32,767  13,107

At the time of quantization, audio data is a “series of discrete digital values. This is the form that is retained within the music data or audio file.

Because it can be handled as PCM data as-is, it does not require decimation or filtering, and is returned to analog signals as-is for playback through speakers.

PCM represents speech in two dimensions: “amplitude” and “time”.

PDM microphone: 3 MHz / 1 bit example

PDM信号例： 110101111011111011010

The louder the sound, the greater the density of 1s, and the quieter the sound, the greater the number of 0s. PDM is a method that expresses sound in terms of “density of 1s and 0s” rather than quantized values, and in order to express high-resolution sound in 1 bit (because it can only be expressed in 1 bit), the sampling frequency must be extremely high. CD sound quality is 44.1 kHz, while PDM is processed at high sampling rates such as 2 to 3 MHz.

In other words, since it is a one-dimensional method that cannot directly represent instantaneous amplitude and indirectly represents the magnitude of sound only in the time direction (density), there is no way to represent amplitude in a single sampling, so amplitude is known only after accumulating a large amount of data in the time direction. PDM can represent sound only in one dimension of “time”. The PDM can only represent sound in one dimension of “time”.

As a result, high sampling causes a state of “high-speed switching of 1 and 0 pulses,” and this high-speed switching itself becomes a high-frequency component that is inaudible to the human ear. As a result, the signal contains both speech components (low frequency) and unwanted high frequency components at the same time.

The high-frequency component must be removed after the fact using a decimation filter (low-pass filter).

3 signal lines required for I2S protocol: BLCK/SCK, WS, and DATA

BCLK / SCK: Clock signals that determine the timing of data transfer between a master (e.g. ESP32) and a slave (e.g. INMP441 microphone). The notation only differs depending on the manufacturer; BCLK = SCK.
WS: A signal that controls switching between left and right channels. In stereo audio, it is necessary to send different audio data to earphones and speakers for “left” and “right” channels. The sound from the left speaker is different from the sound from the right speaker, which creates a three-dimensional effect and expansiveness. Also referred to as L/R (Left/Right).
DATA: The line to which the actual voice data is sent.

One bit is sent at each rising edge of BCLK; without BCLK, the timing at which INMP441 outputs data is undefined and ESP32 cannot receive audio data correctly.

Left channel (L1, L2, L3…) when WS is “0 and the right channel (R1, R2, R3…) when WS is “1”. are sent when WS is “1”.

BCLK:  |‾|_|‾|_|‾|_|‾|_|‾|_|‾|_|‾|_|
WS:    |‾‾‾‾‾‾‾‾|___________|‾‾‾‾‾‾‾‾|
DATA:  L1 L2 L3  R1 R2 R3  L4 L5 L6  R4 R5 R6