Page 38 - 2024F
P. 38
UEC Int’l Mini-Conference No.53 31
represent it as a skeleton graph to model the the mass center, calculated as the average position
pitcher’s body movements with a Spatial-Temporal of visible keypoints, and the similarity of keypoint
Graph Convolutional Network (ST-GCN) [6]. ST- configurations, measured by the average Euclidean
GCN performs graph and temporal convolutions distance between corresponding keypoints. A new
over the pitcher’s skeleton graph across frames, identifier is generated when the combined metric
enabling the detection of subtle differences for clas- exceeds the predefined threshold, ensuring consis-
sifying six pitch types: changeup, curveball, fast- tent tracking across frames.
ball, knuckle-curve, sinker, and slider. We evaluate Once consistent pose IDs are assigned, the
our method on three classification tasks using the pitcher is identified by analyzing the average verti-
MLB-YouTube dataset [7]: a comprehensive six- cal position (y-coordinate) of the mass center and
class pitch type classification, as well as two tar- the average confidence score of keypoints across
geted binary classification tasks. These binary tasks all frames. Priority is given to the pose with the
distinguish between fastballs vs. non-fastballs, and lowest mass center, reflecting the pitcher’s typical
fast vs. slow pitches, as it is a critical analysis of position in the frame due to the consistent point of
pitchers’ performance evaluation in baseball ana- view of official broadcast videos. In cases where
lytics. Recognizing fastballs separately helps teams multiple poses are similarly positioned, such as a
understand a pitcher’s reliance on velocity, while runner on second base or an umpire, the pose with
distinguishing between fast and slow pitches aids the highest average confidence score is selected.
in evaluating a pitcher’s ability to disrupt timing
and deceive batters [4], [8]–[11]. Our approach
demonstrates superior performance compared to B. Spatial-Temporal Graph Convolution Network
other existing methods, such as 3D CNNs, which (ST-GCN)
rely on full-scene video information. We convert the pitcher’s keypoints into a struc-
tured skeleton graph for each frame, following the
II. METHODOLOGY methodology proposed by Yan et al. [6]. In this
Inspired by recent advancements in body move- graph, nodes represent joint locations, while edges
ment analysis, we propose to perform pose esti- depict the physical connections between them, ef-
mation with OpenPose [5] combined with spatial- fectively capturing the spatial relationships between
temporal modeling using ST-GCN [6] to classify joints. These skeleton graphs are then used as
baseball pitches by focusing solely on the pitcher’s input to a ST-GCN [6] to model both the spatial
body mechanics. Our proposal begins by isolating structure and the temporal evolution of the pitcher’s
the pitcher’s pose from all detected bodies. The movements. The ST-GCN architecture requires the
joint coordinates of the isolated pitcher are then complete graph sequence of frames from each
represented as a skeleton graph, which serves as clip to capture temporal dependencies across the
input to the ST-GCN model to capture the spa- entire pitching motion. ST-GCN processes these
tial structure of joints within a single frame and sequences using spatial and temporal modules. The
the temporal dynamics across frames to classify spatial module applies graph convolutions to each
different pitch types based on the pitcher’s body frame, leveraging an adjacency matrix that en-
movements, as illustrated in Figure 1. codes the connections between joints. This matrix
captures inter-joint relationships by defining the
skeleton’s connectivity, enabling the modeling of
A. Pitcher’s Pose Isolation both concentric and eccentric movements, which
The isolation of the pitcher’s pose begins by are essential for capturing the dynamic mechanics
applying OpenPose to all frames of pitch video of pitching. The temporal module applies 1D con-
clips, where multiple poses are detected for each volutions along the temporal axis with a specified
person in every frame. To ensure continuity, each stride to capture motion dynamics between con-
detected pose is assigned a pose ID across all the secutive frames. For our implementation, we used
frames of the clip. This is achieved by matching the original ST-GCN architecture, modifying only
poses in the current frame with those from previous the final classification layer to produce six outputs
frames using a combined metric that considers both corresponding to the pitch types we aim to classify.