Page 66 - 2024S
P. 66
UEC Int’l Mini-Conference No.52 59
Beyond Word Count: Exploring Approximated Target Lengths for
CIF-RNNT
Wen Shen TEO , Yasuhiro MINAMI
∗
Department of Computer and Network Engineering
The University of Electro-Communications
Tokyo, Japan
Keywords: streaming speech recognition, self-information, decoding speed, word segmentation
Abstract
Our previous work proposed the CIF-RNNT architecture, a combination of Continuous Integrate-
and-Fire (CIF) and RNN-Transducers (RNN-T) that compresses speech into units equivalent to lin-
guistic words to achieve efficient decoding. This work extends on that research by investigating the
impact of different target length definitions, approximated from self-information and token count. Our
results on English and Japanese datasets show that approximated target length types based on self-
information outperform simpler approaches, and CIF-RNNT models even surpass topline models on
the Japanese dataset at smaller chunk sizes. Furthermore, our comparisons demonstrate an inherent
ability of CIF-RNNT to produce output tokens in group of words, regardless of the target length type.
These results showcase the potential of the CIF-RNNT architecture for efficient and accurate speech
recognition.
∗ The author is supported by (AiQuSci) MEXT Scholar-
ship.