Page 80 - 2025S
P. 80
UEC Int’l Mini-Conference No.54 73
Detection Model for Audio Deepfakes Using Self-Supervised Learning
to Prevent Identity Spoofing Attacks
Maria Vianney MEDINA *1 and Minami YASUHIRO 2
1 UEC Exchange Study Program (JUSST Program)
2 Yasuhiro Minami’s Department
The University of Electro-Communications, Tokyo, Japan
Keywords: Voice deepfake, Self-Supervised Learning (SSL), Spoofing attacks, Synthetic audio, Ma-
chine learning.
Abstract
The advancement of generative speech models has enabled the creation of highly convincing synthetic
audio, known as voice deepfakes. These artificially generated utterances pose significant risks to security
systems, biometric authentication, and the credibility of digital communications. This paper presents
a method for detecting voice deepfakes by leveraging HuBERT, a self-supervised speech representation
model. The approach involves extracting latent acoustic embeddings from raw audio using HuBERT,
followed by classification through a neural network. The model is trained to distinguish bonafide speech
from spoofed audio, exploiting subtle inconsistencies introduced by generative processes. The system is
evaluated on benchmark datasets and compared against traditional handcrafted features such as MFCC
and CQCC, as well as alternative neural back-ends. Experimental results are expected to demonstrate
the effectiveness of HuBERT representations in detecting various types of audio spoofing attacks and
highlight the potential of self-supervised learning for secure and generalizable deepfake detection.
The author is supported by JASSO Scholarship.
*