Page 12 - 2025S
P. 12
UEC Int’l Mini-Conference No.54 5
Japanese Sign Language Translation from YouTube Videos by
Transformer Model Using Text Pre-processing
∗
Rattapoom KEDTIWERASAK and Hiroki TAKAHASHI
Department of Informatics
The University of Electro-Communications, Tokyo, Japan
Keywords: Sign Language Translation, Transformer, Japanese Sign Language, Text Pre-processing
1 Introduction translation performance, we utilized BiLingual Eval-
Sing Language Translation (SLT) is a task aim- uation Understudy (BLEU) score, which is a metric
ing to automatically translate sign language videos for machine translation.
into spoken language sentences. SLT recently pro- 4 Results
cesses visual-gestural input using Convolutional Neu- Table 2 reports BLEU (B) scores with combination
ral Networks (CNNs) for feature extraction and of text pre-processing techniques. The best perfor-
Transformer-based architectures in end-to-end trans- mance was achieved using Text Normalization (TN),
lation [1]. To break down communication barriers, which included whitespace (w), and Sudachi Mode-
SLT is enabling accessible interpretation services in C, with a BLEU-4 score of 4.54.
areas such as healthcare, and education. Table 2: BLEU score with text pre-processing.
2 Related Work
Japanese Sign Language (JSL) has its own grammar TN w B-1 (↑) B-2 (↑) B-3 (↑) B-4 (↑)
17.63 8.21 4.98 3.43
and vocabulary, which differs from spoken Japanese ✓
and other sign languages. JSL translation presents ✓ 17.11 7.99 4.86 3.36
challenges due to the lack of annotated dataset com- ✓ ✓ 15.45 7.59 4.77 3.42
pared to widely studied sign languages shown in ta- 17.50 9.15 6.09 4.54
ble 1. In addition, data from social media platforms 5 Discussion
like YouTube must require extensive pre-processing The results emphasize the impact of effective
for both videos and captions. From Camgoz et al. text pre-processing on JSL translation performance.
[1], inspired by this work, our approach adopts the YouTube captions require normalization to en-
Transformer architecture for JSL translation which sure quality input. Combining text normalization,
includes text pre-processing techniques. whitespace handling, and Sudachi Mode-C tokeniza-
tion yielded the best BLEU score.
Table 1: Amount of video and hour sign language
comparison in YouTube-SL-25 dataset [2]. 6 Conclusions
This work proposed a Transformer-based model for
Sign Languages #videos #hours JSL translation using YouTube videos, enhanced by
American 16,724 1,394 specialized text pre-processing. EfficientNetB0 was
Indian 3,023 209 used for extracting spatial features, and Sudachi to-
Polish 1,698 137
kenization improved caption handling. The BLEU
German 1,024 108 score improvements confirm the value of preprocess-
Brazilian 846 101 ing in low-resource sign language settings. Future
British 1,026 74
efforts will focus on expanding the dataset and im-
Hungarian 1,687 70 proving temporal alignment between video and text.
Japanese 1,075 62
References
3 Method
To address the challenges of JSL translation, we pro- [1] N. C. Camgoz, O. Koller, S. Hadfield, and
pose the end-to-end Transformer-based architecture R. Bowden, “Sign language transformers: Joint
using text pre-processing. In data preparation, the end-to-end sign language recognition and trans-
YouTube JSL dataset is processed by separating the lation,” 2020.
videos and their corresponding caption texts with the
[2] S. Gueuwou, X. Du, G. Shakhnarovich,
caption time boundary. The dataset is splited into K. Livescu, and A. H. Liu, “Shubert: Self-
training, validation, and testing following 32,191, supervised sign language representation learning
4,023, and 4,025 sentences. The videos are resized
via multi-stream cluster prediction,” 2025.
to 224×224 pixels and processed through Efficient-
NetB0 [3] which serves as the spatial feature extrac- [3] M. Tan and Q. V. Le, “Efficientnet: Rethinking
tor. The captions are normalized using Neologdn model scaling for convolutional neural networks,”
based on Japanese regular expression rules and to- 2020.
kenized using the Sudachi Japanese morphological
[4] K. Takaoka, S. Hisamoto, N. Kawahara,
analyzer [4]. We employ a Transformer model for
training, which takes a sequence of features as input M. Sakamoto, Y. Uchida, and Y. Matsumoto,
and generates the translated output. To measure “Sudachi: a japanese tokenizer for business,”
2018.
∗ The author is supported by SESS MEXT Scholarship.