Page 34 - 2024S
P. 34
UEC Int’l Mini-Conference No.52 27
(UniFormer-S and UniFormer-B), and Tem- designed for the methodology. A clear example
poral Shift Module (TSM). These were inte- is Multiple Instance Learning (MIL) [1], where
grated into a VAD architecture using Bidi- the authors use post-processing of the features
rectional Encoder Representations from Trans- to fix a constant number of features per video.
formers (BERT) with Multiple Instance Learn- Normal and anomalous videos are grouped into
ing (MIL), which is an improvement of the origi- sets called bags, with positive bags containing
nal MIL model [1] proposed by Weijun et al. [2]. features from anomalous videos and negative
UniFormer-S stood out, processing 4.64 clips bags containing features from normal videos.
per second with a computational demand of MIL uses a custom Ranking Loss Function to
28.717 GFLOPs on edge devices like the Jet- differentiate between normal and anomalous be-
son Orin NX (8 GB RAM, 20 W power). On haviors.
the UCF-Crime dataset, UniFormer-S combined Recent models, such as [4], use transformer-
with BERT + MIL achieved an AUC of 79.74%, based architectures, fine-tuning a Unified trans-
highlighting its balance between performance former (UniFormer) network [6], integrating
and efficiency. the extractor and classifier into an end-to-
These findings underscore the potential of end method. Others, like [2], enhance mod-
UniFormer-S and edge devices for effective VAD els like MIL by integrating feature vectors with
implementation in complex, real-world mobile Bidirectional Encoder Representations from
environments. Transformers (BERT) [8], where snippets pre-
calculated by a feature extractor are processed
into segments and then passed to the BERT ar-
2 Video Anomaly Detection chitecture to generate a unified feature vector,
improving anomaly detection. However, these
The study of Video Anomaly Detection (VAD) approaches present challenges for real-world in-
has gained popularity due to its extensive appli- tegration due to their high computational re-
cation in various security settings. Efficiently quirements and complexity.
detecting such events has become a common While each architecture addresses different
practice. Recent research has focused on weakly challenges, most share a common dependency
supervised learning. on feature extractors. These models require
pre-calculated features before analyzing anoma-
2.1 Weakly Supervised Anomaly De- lies, highlighting the importance of selecting the
tection. best feature extractor to enhance the viability
of practical applications.
Weakly supervised learning uses limited labels
to train deep learning models. In VAD, this in- 2.2 Feature Extractors and Their
volves training with video-level labeling, where Role in VAD
each video is assigned a binary label indicat-
ing whether it is normal or anomalous. Dur- In VAD, feature extractors derived from Video
ing testing, the videos are labeled with temporal Action Recognition (VAR) are used to identify
annotations indicating the exact moment where and classify human actions for anomaly detec-
anomalies success, providing a detailed evalua- tion. Commonly used feature extractors in VAD
tion of the model’s performance. include:
Since the goal of VAD or wVAD is to deter- 3D Convolutional Neural Networks
mine when an anomaly occurs in videos, a video (C3D) [9]: These capture temporal informa-
is first divided into clips or snippets, and each is tion through 3D convolution and pooling, pre-
processed by feature extractors to obtain infor- serving temporal context and making them ef-
mative features. Depending on the model, these fective for tasks involving motion dynamics.
features may or may not be post-processed. Fi- Sultani et al. [1] used 3D ConvNets to ex-
nally, the resulting features are analyzed by an tract spatiotemporal information from raw RGB
anomaly classification model using loss functions video clips.