Page 36 - 2024F

P. 36

UEC Int’l Mini-Conference No.53 29

Sammo: Incorporating Linear RNNs into Streaming Zipformer

Encoder

∗
Wen Shen TEO and Yasuhiro MINAMI
Department of Computer and Network Engineering
The University of Electro-Communications, Tokyo, Japan

Keywords: streaming speech encoding, linear recurrent neural network, mamba-2, lightweight ASR

1 Introduction
Modern Automatic Speech Recognition (ASR) sys-
tems utilize Conformer [1] models, combining Con-
volutional Neural Networks (CNNs) and Multi-Head
Self-Attention (MHSA) for effective local and global
dependency modeling. However, real-time streaming
ASR, requiring chunkwise processing, limits MHSA’s
attention range. Conventional methods use key-
value caching to store past context, increasing mem-
ory consumption.
This work introduces Sammo, a novel streaming
ASR encoder that eliminates large key-value caches.
Inspired by Conformers, Sammo integrates a lin-
ear Recurrent Neural Network (RNN), Mamba-2 [2],
alongside CNNs and MHSA. Sammo leverages: (1) Figure 1: CSJ testset CERs across chunk widths.
CNNs for local dependencies, (2) MHSA for mid- ters), demonstrating its efficiency in terms of model
range dependencies within chunks, and (3) Mamba-2 complexity.
for long-range dependencies across historical chunks. 4 Conclusions
Evaluated by modifying Zipformer [3], an open-
source streaming Conformer, Sammo demonstrates This work introduces Sammo, a streaming ASR en-
superior accuracy to Zipformer with significantly coder integrating Mamba-2, a linear RNN, for ef-
ficient long-range dependency modeling. Sammo re-
reduced memory on the Corpus of Spontaneous
Japanese (CSJ) [4], which is a Japanese speech cor- places MHSA key-value caches with Mamba-2’s com-
pus. This work showcases the effectiveness of in- pact hidden state, reducing memory while maintain-
ing performance. Experiments on CSJ show Sammo
tegrating linear RNNs like Mamba-2 for memory-
efficient real-time ASR. outperforms Zipformer at comparable cache sizes and
remains competitive at larger Zipformer caches.
2 Method
References
We introduce 2 modifications to incorporate Mamba-
2 within the proposed Sammo concept. [1] A. Gulati et. al., “Conformer: Convolution-
augmented transformer for speech recognition”
Substitution of NLA with Mamba-2: In or-
der to maintain comparability across model sizes, Interspeech 2020, pp.5036-5040, 2020.
we substituted the NLA module in Zipformer with [2] T. Dao et. al., “Transformers are ssms: Gener-
Mamba-2. alized models and efficient algorithms through
Elimination of MHSA key-value cache: We
structured state space duality” 41st ICML, 2024.
removed all past (left) context from the MHSA mod-
ules within the Zipformer architecture, effectively [3] Z. Yao et. al., “Zipformer: A faster and bet-
preventing inter-chunk interactions of input frames ter encoder for automatic speech recognition”
through MHSA. The Twelfth International Conference on Learn-
3 Results and Discussion ing Representations 2023.
Fig. 1 shows Sammo consistently outperforms Zip- [4] K. Maekawa et. al., “Spontaneous speech corpus
former at 1.28s chunk width, with performance con- of Japanese” LREC’ 2000.
verging at wider chunks. This suggests Mamba-2 in
Sammo better mitigates chunk boundary disconti-
nuities compared to MHSA’s left-context augmenta-
tion.
Furthermore, Sammo achieves these results with a
slightly smaller model (71.68M vs. 71.78M parame-
∗ The author is supported by AiQuSci MEXT Scholarship.

31 32 33 34 35 36 37 38 39 40 41