Page 13 - 2025S
        P. 13
     6                                                                 UEC Int’l Mini-Conference No.54
                  HOI-R1: Exploring the Potential of Multimodal Large Language
                             Models for Human-Object Interaction Detection
                                                    ∗
                                     Junwen CHEN , Peilin Xiong, and Keiji YANAI
                                                Department of Informatics
                                The University of Electro-Communications, Tokyo, Japan
               Keywords: Human-Object Interaction Detection, Multimodal Large Language Model, Reinforcement Learning
                                                               Figure 2:  Training convergence of HOI-R1 with
                                                               Qwen2.5-VL-3B-Instruct on HICO-DET. The mAP
                 Figure 1: Overview of our HOI-R1 framework.
                                                               of Full category on Default Setting is shown.
               1  Introduction
                                                               and text. The question text consists of three part,
               Given an image, Human-Object Interaction Detec-  the task instruction include basic information about
               tion (HOID) methods predict a set of HOI instances  the task, the reasoning guidance provides hints for
               represented as {B h , B o , Object Class, Interaction  the reasoning process, and the format example regu-
               Class}. The bounding boxes B h and B o of Human-  larizes the output. First, a Teacher MLLM model is
               Object (HO) pairs are usually detected by an off-  used to generate reasoning steps for Supervised Fine-
               the-shelf object detector. Recently, Multimodal large  tuning (SFT). Then, in the Reinforcement Learning
               language models (MLLMs) [1] have shown great po-  (RL) stage, the student MLLM model as the policy
               tential in understanding and generating complex vi-
                                                               model is trained with four reward signals.
               sual and textual informationrecent.
                 Despite these advances, MLLMs remain underex-  3  Results
               plored for structured HOID tasks where traditional  In Figure 2, with HOI knowledge distillation, MLLM
               HOID paradigms struggle with architectural com-  shows a significant performance boost. Then, we in-
               plexity and annotation scarcity.  We first explore  troduce RL for further alignment with four reward
               their tremendous potential in HOID tasks, our con-  functions, including format rewards for output struc-
               tributions are summarized as follows:           ture, object/interaction label accuracy, and an one-
                                                               to-one matching HOI IoU reward, and the perfor-
                 • We introduce HOI-R1, the first MLLM frame-  mance can be improved with only 100 training steps.
                   work that solves HOID end-to-end via natural
                   language, eliminating object detectors.     4   Conclusions
                                                               We present HOI-R1, the first pure MLLM frame-
                 • We introduce a SFT with thinking distillation to  work for HOID tasks, which eliminates the need
                   extend the HOI knowledge and a reinforcement  for object detectors. With our proposed SFT and
                   learning (RL) [2] paradigm to align the MLLM  RL paradigm, HOI-R1 achieves a significant perfor-
                   on HOID with our HOI reward functions to fur-  mance boost on the HICO-DET dataset. Our results
                   ther enhance the performance.
                                                               demonstrate the potential of MLLMs in structured
                 • Compared with the baseline, HOI-R1 improves  tasks like HOID, paving the way for future research
                   the performance by a large margin and shows  in this direction.
                   a promising potential for further application in  References
                   real-world scenarios.
                                                               [1] Shuai Bai, et al. “Qwen2.5-vl technical report.”
               2  Method                                          arXiv preprint arXiv:2502.13923, 2025.
               In Figure 1, we illustrate the framework of our HOI-  [2] Zhihong Shao, et al. “Deepseekmath: Pushing
               R1. The input consists of two modalities: image    the limits of mathematical reasoning in open lan-
                 ∗ The author is supported by (Program name) MEXT  guage models.” arXiv preprint arXiv:2402.03300,
               Scholarship.                                       2024.





