Page 36 - 2024F
P. 36

UEC Int’l Mini-Conference No.53                                                               29

                   Sammo: Incorporating Linear RNNs into Streaming Zipformer

                                                       Encoder


                                                        ∗
                                        Wen Shen TEO and Yasuhiro MINAMI
                                    Department of Computer and Network Engineering
                                The University of Electro-Communications, Tokyo, Japan

               Keywords: streaming speech encoding, linear recurrent neural network, mamba-2, lightweight ASR

               1  Introduction
               Modern Automatic Speech Recognition (ASR) sys-
               tems utilize Conformer [1] models, combining Con-
               volutional Neural Networks (CNNs) and Multi-Head
               Self-Attention (MHSA) for effective local and global
               dependency modeling. However, real-time streaming
               ASR, requiring chunkwise processing, limits MHSA’s
               attention range.  Conventional methods use key-
               value caching to store past context, increasing mem-
               ory consumption.
                 This work introduces Sammo, a novel streaming
               ASR encoder that eliminates large key-value caches.
               Inspired by Conformers, Sammo integrates a lin-
               ear Recurrent Neural Network (RNN), Mamba-2 [2],
               alongside CNNs and MHSA. Sammo leverages: (1)    Figure 1: CSJ testset CERs across chunk widths.
               CNNs for local dependencies, (2) MHSA for mid-  ters), demonstrating its efficiency in terms of model
               range dependencies within chunks, and (3) Mamba-2  complexity.
               for long-range dependencies across historical chunks.  4  Conclusions
                 Evaluated by modifying Zipformer [3], an open-
               source streaming Conformer, Sammo demonstrates  This work introduces Sammo, a streaming ASR en-
               superior accuracy to Zipformer with significantly  coder integrating Mamba-2, a linear RNN, for ef-
                                                               ficient long-range dependency modeling. Sammo re-
               reduced memory on the Corpus of Spontaneous
               Japanese (CSJ) [4], which is a Japanese speech cor-  places MHSA key-value caches with Mamba-2’s com-
               pus. This work showcases the effectiveness of in-  pact hidden state, reducing memory while maintain-
                                                               ing performance. Experiments on CSJ show Sammo
               tegrating linear RNNs like Mamba-2 for memory-
               efficient real-time ASR.                        outperforms Zipformer at comparable cache sizes and
                                                               remains competitive at larger Zipformer caches.
               2  Method
                                                               References
               We introduce 2 modifications to incorporate Mamba-
               2 within the proposed Sammo concept.            [1] A. Gulati et. al., “Conformer:  Convolution-
                                                                  augmented transformer for speech recognition”
                 Substitution of NLA with Mamba-2: In or-
               der to maintain comparability across model sizes,  Interspeech 2020, pp.5036-5040, 2020.
               we substituted the NLA module in Zipformer with  [2] T. Dao et. al., “Transformers are ssms: Gener-
               Mamba-2.                                           alized models and efficient algorithms through
                 Elimination of MHSA key-value cache: We
                                                                  structured state space duality” 41st ICML, 2024.
               removed all past (left) context from the MHSA mod-
               ules within the Zipformer architecture, effectively  [3] Z. Yao et. al., “Zipformer: A faster and bet-
               preventing inter-chunk interactions of input frames  ter encoder for automatic speech recognition”
               through MHSA.                                      The Twelfth International Conference on Learn-
               3  Results and Discussion                          ing Representations 2023.
               Fig. 1 shows Sammo consistently outperforms Zip-  [4] K. Maekawa et. al., “Spontaneous speech corpus
               former at 1.28s chunk width, with performance con-  of Japanese” LREC’ 2000.
               verging at wider chunks. This suggests Mamba-2 in
               Sammo better mitigates chunk boundary disconti-
               nuities compared to MHSA’s left-context augmenta-
               tion.
                 Furthermore, Sammo achieves these results with a
               slightly smaller model (71.68M vs. 71.78M parame-
                 ∗ The author is supported by AiQuSci MEXT Scholarship.
   31   32   33   34   35   36   37   38   39   40   41