Page 38 - 2024F
P. 38

UEC Int’l Mini-Conference No.53                                                               31







            represent it as a skeleton graph to model the     the mass center, calculated as the average position
            pitcher’s body movements with a Spatial-Temporal  of visible keypoints, and the similarity of keypoint
            Graph Convolutional Network (ST-GCN) [6]. ST-     configurations, measured by the average Euclidean
            GCN performs graph and temporal convolutions      distance between corresponding keypoints. A new
            over the pitcher’s skeleton graph across frames,  identifier is generated when the combined metric
            enabling the detection of subtle differences for clas-  exceeds the predefined threshold, ensuring consis-
            sifying six pitch types: changeup, curveball, fast-  tent tracking across frames.
            ball, knuckle-curve, sinker, and slider. We evaluate  Once consistent pose IDs are assigned, the
            our method on three classification tasks using the  pitcher is identified by analyzing the average verti-
            MLB-YouTube dataset [7]: a comprehensive six-     cal position (y-coordinate) of the mass center and
            class pitch type classification, as well as two tar-  the average confidence score of keypoints across
            geted binary classification tasks. These binary tasks  all frames. Priority is given to the pose with the
            distinguish between fastballs vs. non-fastballs, and  lowest mass center, reflecting the pitcher’s typical
            fast vs. slow pitches, as it is a critical analysis of  position in the frame due to the consistent point of
            pitchers’ performance evaluation in baseball ana-  view of official broadcast videos. In cases where
            lytics. Recognizing fastballs separately helps teams  multiple poses are similarly positioned, such as a
            understand a pitcher’s reliance on velocity, while  runner on second base or an umpire, the pose with
            distinguishing between fast and slow pitches aids  the highest average confidence score is selected.
            in evaluating a pitcher’s ability to disrupt timing
            and deceive batters [4], [8]–[11]. Our approach
            demonstrates superior performance compared to     B. Spatial-Temporal Graph Convolution Network
            other existing methods, such as 3D CNNs, which    (ST-GCN)
            rely on full-scene video information.               We convert the pitcher’s keypoints into a struc-
                                                              tured skeleton graph for each frame, following the
                         II. METHODOLOGY                      methodology proposed by Yan et al. [6]. In this
              Inspired by recent advancements in body move-   graph, nodes represent joint locations, while edges
            ment analysis, we propose to perform pose esti-   depict the physical connections between them, ef-
            mation with OpenPose [5] combined with spatial-   fectively capturing the spatial relationships between
            temporal modeling using ST-GCN [6] to classify    joints. These skeleton graphs are then used as
            baseball pitches by focusing solely on the pitcher’s  input to a ST-GCN [6] to model both the spatial
            body mechanics. Our proposal begins by isolating  structure and the temporal evolution of the pitcher’s
            the pitcher’s pose from all detected bodies. The  movements. The ST-GCN architecture requires the
            joint coordinates of the isolated pitcher are then  complete graph sequence of frames from each
            represented as a skeleton graph, which serves as  clip to capture temporal dependencies across the
            input to the ST-GCN model to capture the spa-     entire pitching motion. ST-GCN processes these
            tial structure of joints within a single frame and  sequences using spatial and temporal modules. The
            the temporal dynamics across frames to classify   spatial module applies graph convolutions to each
            different pitch types based on the pitcher’s body  frame, leveraging an adjacency matrix that en-
            movements, as illustrated in Figure 1.            codes the connections between joints. This matrix
                                                              captures inter-joint relationships by defining the
                                                              skeleton’s connectivity, enabling the modeling of
            A. Pitcher’s Pose Isolation                       both concentric and eccentric movements, which
              The isolation of the pitcher’s pose begins by   are essential for capturing the dynamic mechanics
            applying OpenPose to all frames of pitch video    of pitching. The temporal module applies 1D con-
            clips, where multiple poses are detected for each  volutions along the temporal axis with a specified
            person in every frame. To ensure continuity, each  stride to capture motion dynamics between con-
            detected pose is assigned a pose ID across all the  secutive frames. For our implementation, we used
            frames of the clip. This is achieved by matching  the original ST-GCN architecture, modifying only
            poses in the current frame with those from previous  the final classification layer to produce six outputs
            frames using a combined metric that considers both  corresponding to the pitch types we aim to classify.
   33   34   35   36   37   38   39   40   41   42   43