Page 33 - 2024S
P. 33

26                                                                UEC Int’l Mini-Conference No.52








              Optimal Feature Extractor for Video Anomaly Detection in Mobile

                                                      Devices


                                                                  2
                        Jonathan FLORES     *1 , Gibran BENITEZ , and Hiroki TAKAHASHI       2
                                  1 UEC Exchange Study Program (JUSST Program)
                                             2 Department of Informatics
                              The University of Electro-Communications, Tokyo, Japan






                                                       Abstract

                   Video Anomaly Detection (VAD) is essential for enhancing video surveillance, particularly in com-
                plex real-world environments using mobile devices where computational efficiency is crucial. In this
                paper, we evaluated five feature extractors: Inflated 3D ConvNets (I3D), 3D Convolutional Neural
                Networks (C3D), Unified Transformer (UniFormer-S and UniFormer-B), and Temporal Shift Module
                (TSM). These were integrated into a VAD architecture using Bidirectional Encoder Representations
                from Transformers (BERT) with Multiple Instance Learning (MIL). UniFormer-S stood out, pro-
                cessing 4.64 clips per second with a computational demand of 28.717 GFLOPs on edge devices like
                the Jetson Orin NX (8GB RAM, 20W power). On the UCF-Crime dataset, UniFormer-S combined
                with BERT + MIL achieved an AUC of 79.74%, highlighting its balance of performance and effi-
                ciency. These findings underscored the potential of UniFormer-S in edge devices for effective VAD
                implementation in complex, real-world mobile environments.

            Keywords: Video anomaly detection, Edge device, Feature extractor, Mobile devices

            1    Introduction                                 instances with the highest scoring discrepancies,
                                                              MIL ensures accurate and reliable anomaly de-
            In recent years, Video Anomaly Detection          tection.
            (VAD) has emerged as a crucial field for im-        Although MIL is a prominent example, there
            proving video surveillance, especially in com-    are other methodologies in the field that also
            plex real-world environments where computa-
                                                              demonstrate notable precision [1–4,7,10]. How-
            tional efficiency is essential for mobile devices.
                                                              ever, their implementation in real-world en-
            The main goal of VAD is to determine when         vironments presents significant challenges, as
            an anomaly occurs (an event that deviates from    they require demanding hardware and high en-
            normal behavior) and distinguish it from nor-     ergy consumption, complicating their integra-
            mal events.   Generally, a learning paradigm      tion into mobile devices. A common component
            called weakly supervised is applied in this field,
                                                              in most VAD methodologies is the feature ex-
            which uses video-level labels during training and  tractor, which is key to determining if a model
            frame-level annotations during testing. A no-     can correctly detect anomalies and also influ-
            table proposal within this paradigm is Multiple   ences the model’s resource consumption. There-
            Instance Learning (MIL), proposed by Sultani      fore, for integration into a mobile device, it is
            et al. [1]. This method classifies video segments
                                                              necessary to choose the best feature extractor.
            into positive and negative bags to distinguish
            between normal and anomalous behaviors using        To address these challenges, we evaluated five
            a pre-trained feature extractor. By emphasizing   popular feature extractors used in VAD: In-
                                                              flated 3D ConvNets (I3D), 3D Convolutional
               The author is supported by JASSO Scholarship.
               *                                              Neural Networks (C3D), Unified Transformer
   28   29   30   31   32   33   34   35   36   37   38