Page 35 - 2024S
P. 35

28                                                                UEC Int’l Mini-Conference No.52



























            Figure 1: Block diagram of the BERT + MIL architecture. Pink arrow indicate the training phase,
            blue arrows represent the inference or testing phase, and black arrows show the common process.



              Inflated 3D ConvNets (I3D) [5]: I3D             implementations with only MIL and with BERT
            extends 2D ConvNets into 3D, adding tempo-        + MIL, improving VAD analysis in a single shot.
            ral dimensions to capture spatiotemporal in-        Weijun et al. [2] observed that previous re-
            formation. Recent research [2, 3] has replaced    search often overlooked full video classification.
            C3D with I3D, extracting information from the     They proposed incorporating full video classifi-
            mixed 5c layer.                                   cation supervision into the MIL [1] framework
              Unified Transformer (UniFormer) [6]:            (Figure 1) by aggregating features from video
            UniFormer combines 3D convolution and spa-        snippets into a unified classification embedding
            tiotemporal self-attention, capturing global and  (y cls ).
            local dependencies. Karim et al. [4] use Uni-       Using BERT (Bidirectional Encoder Repre-
            Former to extract features from 32-frame clips,   sentations from Transformers) [8], a model that
            optimizing anomaly detection by comparing fea-    captures contextual relationships bidirection-
            tures of normal and abnormal videos.              ally, they combined snippet features into a clas-
              Temporal Shift Module (TSM) [11]:               sification embedding to enhance anomaly detec-
            TSM shifts portions of the feature map channels   tion. They applied binary cross-entropy (BCE)
            in time, allowing 2D networks to capture tem-     loss to the BERT embedding in MIL, along with
            poral dependencies without increasing compu-      the MIL ranking loss.
                                ¨
            tational complexity. Ozt¨urk et al. [10] proposed   This approach improved wVAD methodolo-
            ADNet, using features from I3D and TSM to         gies by integrating full video classification, in-
            enhance accuracy in detecting anomalous seg-      creasing model accuracy and performance. It
            ments.                                            also provided flexibility to incorporate any fea-
              Given the variety of feature extractors, the op-  ture extractor, allowing evaluation of different
            timal model for anomaly detection in real-world   extractors to determine the most effective one
            environments remains an open question. This       for weakly supervised video anomaly detection.
            study evaluates several extractors to determine
            the best performance in practical applications.   3    Experimental Setup


            2.3   BERT + MIL                                  As described in [2] and following what is es-
                                                              tablished in Figure 1, the methodology is di-
            To determine the most optimal feature extractor   vided into two processes: training and inference.
            for integration in real-world scenarios using a   During training, N raw video clips of 16 con-
            mobile device, we selected an anomaly detector    secutive non-overlapping frames with a central
            proposed by Weijun et al. [2]. This detector can  crop of 224x224 pixels in RGB format are ex-
            utilize various feature extractors and allows for  tracted. The features from each clip are then
   30   31   32   33   34   35   36   37   38   39   40