Page 81 - 2025S
P. 81

74                                                                UEC Int’l Mini-Conference No.54

                                    Detection Model for Audio Deepfakes
                                 Using Self-Supervised Learning to Prevent

                                             Identity Spoofing Attacks

                                        ¹Medina Castro Maria Vianney, ²Prof. Minami Yasuhiro
                                ¹Instituto Politecnico Nacional, Mexico, UEC Exchange Study Program JUSST,
                      Department of Computer and Network Engineering, ²Graduate School of Informatics and Engineering
                                     The University of Electro-Communications, Japan


                               INTRODUCTION
               Voice cloning techniques based on artificial intelligence have
               advanced significantly, enabling the generation of synthetic
               audio  that  is  nearly  indistinguishable  from  real  human  a)
               voices.  These  artificial  voices,  known  as  audio  deepfakes,
               pose a growing threat to digital security and the authenticity
               of communications, as they can be used to carry out identity
               spoofing,   manipulate   conversations,   or   spread
               misinformation.
                               METHODOLOGY                            b)



                                                                 Fig. 3 Visualization of the frequency spectrum over time for
                                                                       a real audio signal (a) and a fake one (b).
                                                                    EXPERIMENTAL RESULTS AND CONCLUSIONS
                                                                    The  system  correctly  distinguished  between
                                                                    genuine and synthetic audio with high accuracy.
                                                                    Representations extracted with HuBERT were used
                                                                    for classification.
                                                                    The results achieved an f1-score of up to 0.905 and
                                                                    showed a low error rate.
                Fig. 1 Process of extracting features from an audio signal
                               using HuBERT.
                                                                                            1410    3
              The processing flow of an audio signal is shown, up to the
              obtaining  of  embeddings  using  a  CNN  encoder  and  a
              transformer.                                                                   8      1405



                                                                  Fig. 4 Representation  Fig. 5 Classifier confusion
                                                                     embeddings               matrix

                                                                         Table 1. Results for SSL feature selection







                Fig. 2 Process of classifying audio characteristics to determine
                               their authenticity.
                                                                 As future work, the model will be evaluated in various
               References                                        environments,   different   languages,   and   more
                  asdasd
              [1]     Abdeldayem, M. (2024). The Fake-or-Real (FoR) Dataset (deepfake audio) [Data set].  databases.
   76   77   78   79   80   81   82   83   84   85   86