Page 12 - 2025S
P. 12

UEC Int’l Mini-Conference No.54                                                                5

                    Japanese Sign Language Translation from YouTube Videos by

                              Transformer Model Using Text Pre-processing

                                                              ∗
                               Rattapoom KEDTIWERASAK and Hiroki TAKAHASHI
                                                Department of Informatics
                                 The University of Electro-Communications, Tokyo, Japan

               Keywords: Sign Language Translation, Transformer, Japanese Sign Language, Text Pre-processing
               1  Introduction                                 translation performance, we utilized BiLingual Eval-
               Sing Language Translation (SLT) is a task aim-  uation Understudy (BLEU) score, which is a metric
               ing to automatically translate sign language videos  for machine translation.
               into spoken language sentences. SLT recently pro-  4  Results
               cesses visual-gestural input using Convolutional Neu-  Table 2 reports BLEU (B) scores with combination
               ral Networks (CNNs) for feature extraction and  of text pre-processing techniques. The best perfor-
               Transformer-based architectures in end-to-end trans-  mance was achieved using Text Normalization (TN),
               lation [1]. To break down communication barriers,  which included whitespace (w), and Sudachi Mode-
               SLT is enabling accessible interpretation services in  C, with a BLEU-4 score of 4.54.
               areas such as healthcare, and education.          Table 2: BLEU score with text pre-processing.
               2  Related Work
               Japanese Sign Language (JSL) has its own grammar  TN   w   B-1 (↑)  B-2 (↑)  B-3 (↑)  B-4 (↑)
                                                                          17.63     8.21     4.98     3.43
               and vocabulary, which differs from spoken Japanese  ✓
               and other sign languages. JSL translation presents     ✓    17.11    7.99     4.86     3.36
               challenges due to the lack of annotated dataset com-  ✓  ✓  15.45    7.59     4.77     3.42
               pared to widely studied sign languages shown in ta-         17.50    9.15     6.09    4.54
               ble 1. In addition, data from social media platforms  5  Discussion
               like YouTube must require extensive pre-processing  The results emphasize the impact of effective
               for both videos and captions. From Camgoz et al.  text pre-processing on JSL translation performance.
               [1], inspired by this work, our approach adopts the  YouTube captions require normalization to en-
               Transformer architecture for JSL translation which  sure quality input. Combining text normalization,
               includes text pre-processing techniques.        whitespace handling, and Sudachi Mode-C tokeniza-
                                                               tion yielded the best BLEU score.
               Table 1: Amount of video and hour sign language
               comparison in YouTube-SL-25 dataset [2].        6   Conclusions
                                                               This work proposed a Transformer-based model for
                   Sign Languages   #videos   #hours           JSL translation using YouTube videos, enhanced by
                      American       16,724    1,394           specialized text pre-processing. EfficientNetB0 was
                        Indian        3,023     209            used for extracting spatial features, and Sudachi to-
                        Polish        1,698     137
                                                               kenization improved caption handling. The BLEU
                       German         1,024     108            score improvements confirm the value of preprocess-
                       Brazilian      846       101            ing in low-resource sign language settings. Future
                       British        1,026      74
                                                               efforts will focus on expanding the dataset and im-
                      Hungarian       1,687      70            proving temporal alignment between video and text.
                       Japanese       1,075      62
                                                               References
               3  Method
               To address the challenges of JSL translation, we pro-  [1] N. C. Camgoz, O. Koller, S. Hadfield, and
               pose the end-to-end Transformer-based architecture  R. Bowden, “Sign language transformers: Joint
               using text pre-processing. In data preparation, the  end-to-end sign language recognition and trans-
               YouTube JSL dataset is processed by separating the  lation,” 2020.
               videos and their corresponding caption texts with the
                                                               [2] S.  Gueuwou,  X.  Du,  G.  Shakhnarovich,
               caption time boundary. The dataset is splited into  K. Livescu, and A. H. Liu, “Shubert:  Self-
               training, validation, and testing following 32,191,  supervised sign language representation learning
               4,023, and 4,025 sentences. The videos are resized
                                                                  via multi-stream cluster prediction,” 2025.
               to 224×224 pixels and processed through Efficient-
               NetB0 [3] which serves as the spatial feature extrac-  [3] M. Tan and Q. V. Le, “Efficientnet: Rethinking
               tor. The captions are normalized using Neologdn    model scaling for convolutional neural networks,”
               based on Japanese regular expression rules and to-  2020.
               kenized using the Sudachi Japanese morphological
                                                               [4] K.  Takaoka,  S.  Hisamoto,  N.  Kawahara,
               analyzer [4]. We employ a Transformer model for
               training, which takes a sequence of features as input  M. Sakamoto, Y. Uchida, and Y. Matsumoto,
               and generates the translated output. To measure    “Sudachi: a japanese tokenizer for business,”
                                                                  2018.
                 ∗ The author is supported by SESS MEXT Scholarship.
   7   8   9   10   11   12   13   14   15   16   17