Page 34 - 2024S
P. 34

UEC Int’l Mini-Conference No.52                                                               27







            (UniFormer-S and UniFormer-B), and Tem-           designed for the methodology. A clear example
            poral Shift Module (TSM). These were inte-        is Multiple Instance Learning (MIL) [1], where
            grated into a VAD architecture using Bidi-        the authors use post-processing of the features
            rectional Encoder Representations from Trans-     to fix a constant number of features per video.
            formers (BERT) with Multiple Instance Learn-      Normal and anomalous videos are grouped into
            ing (MIL), which is an improvement of the origi-  sets called bags, with positive bags containing
            nal MIL model [1] proposed by Weijun et al. [2].  features from anomalous videos and negative
              UniFormer-S stood out, processing 4.64 clips    bags containing features from normal videos.
            per second with a computational demand of         MIL uses a custom Ranking Loss Function to
            28.717 GFLOPs on edge devices like the Jet-       differentiate between normal and anomalous be-
            son Orin NX (8 GB RAM, 20 W power). On            haviors.
            the UCF-Crime dataset, UniFormer-S combined         Recent models, such as [4], use transformer-
            with BERT + MIL achieved an AUC of 79.74%,        based architectures, fine-tuning a Unified trans-
            highlighting its balance between performance      former (UniFormer) network [6], integrating
            and efficiency.                                   the extractor and classifier into an end-to-
              These findings underscore the potential of      end method.   Others, like [2], enhance mod-
            UniFormer-S and edge devices for effective VAD    els like MIL by integrating feature vectors with
            implementation in complex, real-world mobile      Bidirectional Encoder Representations from
            environments.                                     Transformers (BERT) [8], where snippets pre-
                                                              calculated by a feature extractor are processed
                                                              into segments and then passed to the BERT ar-
            2    Video Anomaly Detection                      chitecture to generate a unified feature vector,
                                                              improving anomaly detection. However, these
            The study of Video Anomaly Detection (VAD)        approaches present challenges for real-world in-
            has gained popularity due to its extensive appli-  tegration due to their high computational re-
            cation in various security settings. Efficiently  quirements and complexity.
            detecting such events has become a common           While each architecture addresses different
            practice. Recent research has focused on weakly   challenges, most share a common dependency
            supervised learning.                              on feature extractors.  These models require
                                                              pre-calculated features before analyzing anoma-
            2.1   Weakly Supervised Anomaly De-               lies, highlighting the importance of selecting the
                  tection.                                    best feature extractor to enhance the viability
                                                              of practical applications.
            Weakly supervised learning uses limited labels
            to train deep learning models. In VAD, this in-   2.2   Feature    Extractors    and    Their
            volves training with video-level labeling, where        Role in VAD
            each video is assigned a binary label indicat-
            ing whether it is normal or anomalous. Dur-       In VAD, feature extractors derived from Video
            ing testing, the videos are labeled with temporal  Action Recognition (VAR) are used to identify
            annotations indicating the exact moment where     and classify human actions for anomaly detec-
            anomalies success, providing a detailed evalua-   tion. Commonly used feature extractors in VAD
            tion of the model’s performance.                  include:
              Since the goal of VAD or wVAD is to deter-        3D   Convolutional    Neural    Networks
            mine when an anomaly occurs in videos, a video    (C3D) [9]: These capture temporal informa-
            is first divided into clips or snippets, and each is  tion through 3D convolution and pooling, pre-
            processed by feature extractors to obtain infor-  serving temporal context and making them ef-
            mative features. Depending on the model, these    fective for tasks involving motion dynamics.
            features may or may not be post-processed. Fi-    Sultani et al. [1] used 3D ConvNets to ex-
            nally, the resulting features are analyzed by an  tract spatiotemporal information from raw RGB
            anomaly classification model using loss functions  video clips.
   29   30   31   32   33   34   35   36   37   38   39