Page 13 - 2025S

P. 13

6 UEC Int’l Mini-Conference No.54

HOI-R1: Exploring the Potential of Multimodal Large Language

Models for Human-Object Interaction Detection

∗
Junwen CHEN , Peilin Xiong, and Keiji YANAI
Department of Informatics
The University of Electro-Communications, Tokyo, Japan

Keywords: Human-Object Interaction Detection, Multimodal Large Language Model, Reinforcement Learning

Figure 2: Training convergence of HOI-R1 with
Qwen2.5-VL-3B-Instruct on HICO-DET. The mAP
Figure 1: Overview of our HOI-R1 framework.
of Full category on Default Setting is shown.

1 Introduction
and text. The question text consists of three part,
Given an image, Human-Object Interaction Detec- the task instruction include basic information about
tion (HOID) methods predict a set of HOI instances the task, the reasoning guidance provides hints for
represented as {B h , B o , Object Class, Interaction the reasoning process, and the format example regu-
Class}. The bounding boxes B h and B o of Human- larizes the output. First, a Teacher MLLM model is
Object (HO) pairs are usually detected by an off- used to generate reasoning steps for Supervised Fine-
the-shelf object detector. Recently, Multimodal large tuning (SFT). Then, in the Reinforcement Learning
language models (MLLMs) [1] have shown great po- (RL) stage, the student MLLM model as the policy
tential in understanding and generating complex vi-
model is trained with four reward signals.
sual and textual informationrecent.
Despite these advances, MLLMs remain underex- 3 Results
plored for structured HOID tasks where traditional In Figure 2, with HOI knowledge distillation, MLLM
HOID paradigms struggle with architectural com- shows a significant performance boost. Then, we in-
plexity and annotation scarcity. We first explore troduce RL for further alignment with four reward
their tremendous potential in HOID tasks, our con- functions, including format rewards for output struc-
tributions are summarized as follows: ture, object/interaction label accuracy, and an one-
to-one matching HOI IoU reward, and the perfor-
• We introduce HOI-R1, the first MLLM frame- mance can be improved with only 100 training steps.
work that solves HOID end-to-end via natural
language, eliminating object detectors. 4 Conclusions
We present HOI-R1, the first pure MLLM frame-
• We introduce a SFT with thinking distillation to work for HOID tasks, which eliminates the need
extend the HOI knowledge and a reinforcement for object detectors. With our proposed SFT and
learning (RL) [2] paradigm to align the MLLM RL paradigm, HOI-R1 achieves a significant perfor-
on HOID with our HOI reward functions to fur- mance boost on the HICO-DET dataset. Our results
ther enhance the performance.
demonstrate the potential of MLLMs in structured
• Compared with the baseline, HOI-R1 improves tasks like HOID, paving the way for future research
the performance by a large margin and shows in this direction.
a promising potential for further application in References
real-world scenarios.
[1] Shuai Bai, et al. “Qwen2.5-vl technical report.”
2 Method arXiv preprint arXiv:2502.13923, 2025.
In Figure 1, we illustrate the framework of our HOI- [2] Zhihong Shao, et al. “Deepseekmath: Pushing
R1. The input consists of two modalities: image the limits of mathematical reasoning in open lan-
∗ The author is supported by (Program name) MEXT guage models.” arXiv preprint arXiv:2402.03300,
Scholarship. 2024.

8 9 10 11 12 13 14 15 16 17 18