Page 19 - 2024F
P. 19

12                                                                UEC Int’l Mini-Conference No.53

                     SOV-STG-VLA: SOV Decoding with Specific Target Guided

                                  DeNoising and Vision Language Advisor


                                    Junwen Chen   ∗  Yingcheng Wang and Keiji Yanai

                                                Department of Informatics
                                The University of Electro-Communications, Tokyo, Japan

               Keywords: Human-object Interaction Detection, Transformer

               1  Introduction                                 introduce a novel verb box, Adaptive Shifted Mini-
                                                               mum Bounding Rectangle (ASMBR) as the repre-
               Recent transformer-based HOID methods leverage
                                                               sentation of the interaction region.
               DETR and VLM priors but suffer from long training
               and complex optimization due to entangled object  Vision Advisor.  We leverage the visual encoder
               detection and HOI recognition. Moreover, the am-  and Q-Former of BLIP2 [1] to extract the image-level
                                                               features.
               biguous query embeddings and the HOI-verb label
               gap remains unaddressed.                        Verb-HOI Bridge.    For our Verb-HOI Bridge
                 To this end, we propose SOV-STG-VLA with three  (also language advisor), we predict the HOI classes
               key components: SOV decoding to disentangle tasks,  in a two-step manner to fill the gap between the verb
               STG denoising for efficient training, and VLA to  recognition and HOI recognition.
               integrate VLM knowledge. The VLA decoder en-
               hances interaction representation with a Verb-HOI
               bridge, achieving SOTA performance in one-sixth of
               the training epochs.





                                                               Figure 2: Comparison of the training convergence
                                                               curves of the state-of-the-art methods on the HICO-
                                                               DET dataset.

                                                               3   Results
                  Figure 1: The overview of SOV-STG-VLA.
                                                               In Fig. 2, we illustrate the training convergence of
               2  Method                                       our method and recent transformer-based methods,
                                                               and the results show that our STG strategy effec-
               As shown in Fig. 1, our SOV-STG consists of STG
                                                               tively accelerates the training convergence before the
               label priors initialization, subject-object detection,  learning rate drops and finally improves the perfor-
               and verb recognition. STG-trained label embeddings  mance.
               initialize label querie Q ov . Subject-object decoders
                                                               4   Conclusions
               refine anchor boxes B s and B o , while verb boxes B v
               are generated via adaptive shifted MBR. SOV-STG-  We propose SOV-STG-VLA, a novel one-stage HOID
               VLA extends SOV-STG by enriching verb embed-    framework with SOV decoding for target-specific
               dings E v using VLA, which integrates global context  processing and STG denoising for efficient train-
               from a feature extractor, pretrained VLM, and spa-  ing.  VLA aligns VLM knowledge with verb em-
               tial verb box information. The V-HOI Bridge links  beddings, while the V-HOI Bridge enhances verb
               HOI and verb label predictions for enhanced inter-  and HOI prediction. Our framework achieves SOTA
               action learning.                                performance with fast convergence and potential for
               Subject and Object Detection.  We leverage a    broader vision-language tasks.
               hierarchical backbone and deformable transformer  References
               encoder [2] as the feature extractor.
                                                               [1] Junnan Li, Dongxu Li, Silvio Savarese, and Steven
               Verb Decoder with S-O attention module.            Hoi.  Blip-2:  Bootstrapping language-image pre-
               We introduce S-O attention to fuse the subject and  training with frozen image encoders and large lan-
               object embeddings in a multi-layer manner.         guage models. In International conference on ma-
               Verb box represented by ASMBR       To con-        chine learning, pages 19730–19742. PMLR, 2023.
               strain the verb feature extraction with positional in-  [2] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiao-
               formation in the verb decoder, as shown in Fig. 1, we  gang Wang, and Jifeng Dai. Deformable DETR: De-
                                                                  formable transformers for end-to-end object detec-
                 ∗ The author is supported by (Program name) MEXT
               Scholarship.                                       tion. In ICLR, 2020.
   14   15   16   17   18   19   20   21   22   23   24