Page 33 - 2024S
P. 33
26 UEC Int’l Mini-Conference No.52
Optimal Feature Extractor for Video Anomaly Detection in Mobile
Devices
2
Jonathan FLORES *1 , Gibran BENITEZ , and Hiroki TAKAHASHI 2
1 UEC Exchange Study Program (JUSST Program)
2 Department of Informatics
The University of Electro-Communications, Tokyo, Japan
Abstract
Video Anomaly Detection (VAD) is essential for enhancing video surveillance, particularly in com-
plex real-world environments using mobile devices where computational efficiency is crucial. In this
paper, we evaluated five feature extractors: Inflated 3D ConvNets (I3D), 3D Convolutional Neural
Networks (C3D), Unified Transformer (UniFormer-S and UniFormer-B), and Temporal Shift Module
(TSM). These were integrated into a VAD architecture using Bidirectional Encoder Representations
from Transformers (BERT) with Multiple Instance Learning (MIL). UniFormer-S stood out, pro-
cessing 4.64 clips per second with a computational demand of 28.717 GFLOPs on edge devices like
the Jetson Orin NX (8GB RAM, 20W power). On the UCF-Crime dataset, UniFormer-S combined
with BERT + MIL achieved an AUC of 79.74%, highlighting its balance of performance and effi-
ciency. These findings underscored the potential of UniFormer-S in edge devices for effective VAD
implementation in complex, real-world mobile environments.
Keywords: Video anomaly detection, Edge device, Feature extractor, Mobile devices
1 Introduction instances with the highest scoring discrepancies,
MIL ensures accurate and reliable anomaly de-
In recent years, Video Anomaly Detection tection.
(VAD) has emerged as a crucial field for im- Although MIL is a prominent example, there
proving video surveillance, especially in com- are other methodologies in the field that also
plex real-world environments where computa-
demonstrate notable precision [1–4,7,10]. How-
tional efficiency is essential for mobile devices.
ever, their implementation in real-world en-
The main goal of VAD is to determine when vironments presents significant challenges, as
an anomaly occurs (an event that deviates from they require demanding hardware and high en-
normal behavior) and distinguish it from nor- ergy consumption, complicating their integra-
mal events. Generally, a learning paradigm tion into mobile devices. A common component
called weakly supervised is applied in this field,
in most VAD methodologies is the feature ex-
which uses video-level labels during training and tractor, which is key to determining if a model
frame-level annotations during testing. A no- can correctly detect anomalies and also influ-
table proposal within this paradigm is Multiple ences the model’s resource consumption. There-
Instance Learning (MIL), proposed by Sultani fore, for integration into a mobile device, it is
et al. [1]. This method classifies video segments
necessary to choose the best feature extractor.
into positive and negative bags to distinguish
between normal and anomalous behaviors using To address these challenges, we evaluated five
a pre-trained feature extractor. By emphasizing popular feature extractors used in VAD: In-
flated 3D ConvNets (I3D), 3D Convolutional
The author is supported by JASSO Scholarship.
* Neural Networks (C3D), Unified Transformer