Page 19 - 2024F

P. 19

12 UEC Int’l Mini-Conference No.53

SOV-STG-VLA: SOV Decoding with Speciﬁc Target Guided

DeNoising and Vision Language Advisor

Junwen Chen ∗ Yingcheng Wang and Keiji Yanai

Department of Informatics
The University of Electro-Communications, Tokyo, Japan

Keywords: Human-object Interaction Detection, Transformer

1 Introduction introduce a novel verb box, Adaptive Shifted Mini-
mum Bounding Rectangle (ASMBR) as the repre-
Recent transformer-based HOID methods leverage
sentation of the interaction region.
DETR and VLM priors but suﬀer from long training
and complex optimization due to entangled object Vision Advisor. We leverage the visual encoder
detection and HOI recognition. Moreover, the am- and Q-Former of BLIP2 [1] to extract the image-level
features.
biguous query embeddings and the HOI-verb label
gap remains unaddressed. Verb-HOI Bridge. For our Verb-HOI Bridge
To this end, we propose SOV-STG-VLA with three (also language advisor), we predict the HOI classes
key components: SOV decoding to disentangle tasks, in a two-step manner to ﬁll the gap between the verb
STG denoising for eﬃcient training, and VLA to recognition and HOI recognition.
integrate VLM knowledge. The VLA decoder en-
hances interaction representation with a Verb-HOI
bridge, achieving SOTA performance in one-sixth of
the training epochs.

Figure 2: Comparison of the training convergence
curves of the state-of-the-art methods on the HICO-
DET dataset.

3 Results
Figure 1: The overview of SOV-STG-VLA.
In Fig. 2, we illustrate the training convergence of
2 Method our method and recent transformer-based methods,
and the results show that our STG strategy eﬀec-
As shown in Fig. 1, our SOV-STG consists of STG
tively accelerates the training convergence before the
label priors initialization, subject-object detection, learning rate drops and ﬁnally improves the perfor-
and verb recognition. STG-trained label embeddings mance.
initialize label querie Q ov . Subject-object decoders
4 Conclusions
reﬁne anchor boxes B s and B o , while verb boxes B v
are generated via adaptive shifted MBR. SOV-STG- We propose SOV-STG-VLA, a novel one-stage HOID
VLA extends SOV-STG by enriching verb embed- framework with SOV decoding for target-speciﬁc
dings E v using VLA, which integrates global context processing and STG denoising for eﬃcient train-
from a feature extractor, pretrained VLM, and spa- ing. VLA aligns VLM knowledge with verb em-
tial verb box information. The V-HOI Bridge links beddings, while the V-HOI Bridge enhances verb
HOI and verb label predictions for enhanced inter- and HOI prediction. Our framework achieves SOTA
action learning. performance with fast convergence and potential for
Subject and Object Detection. We leverage a broader vision-language tasks.
hierarchical backbone and deformable transformer References
encoder [2] as the feature extractor.
[1] Junnan Li, Dongxu Li, Silvio Savarese, and Steven
Verb Decoder with S-O attention module. Hoi. Blip-2: Bootstrapping language-image pre-
We introduce S-O attention to fuse the subject and training with frozen image encoders and large lan-
object embeddings in a multi-layer manner. guage models. In International conference on ma-
Verb box represented by ASMBR To con- chine learning, pages 19730–19742. PMLR, 2023.
strain the verb feature extraction with positional in- [2] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiao-
formation in the verb decoder, as shown in Fig. 1, we gang Wang, and Jifeng Dai. Deformable DETR: De-
formable transformers for end-to-end object detec-
∗ The author is supported by (Program name) MEXT
Scholarship. tion. In ICLR, 2020.

14 15 16 17 18 19 20 21 22 23 24