Page 17 - 2025S
P. 17

10                                                                UEC Int’l Mini-Conference No.54





































                                Figure 1: The framework of the base method FontDiffuser.



            ture includes a content encoder E c to extract      Three sub-components are combined with the
            structural features from a source glyph x c , a   loss function. A Mean Square Error (MSE) loss,
            style encoder E s to represent font style x s , and  a content perceptual loss and a deformation off-
            a conditional U-Net-based diffusion model that    set loss. Specifically, MSE was used to mea-
            predicts clean glyphs by gradually denoising ran-  sure the difference between predicted noise ϵ θ
            dom noise. In addition, the U-Net has two addi-   and true noise ϵ. The loss function encourages
            tional modules, the Multi-Scale Content Aggre-    the model to accurately predict noise in back-
            gation (MCA) module injects multi-resolution      diffusion and generate images that recover the
            content features that keep structural detail, and  original image. Content perceptual loss using
            the Reference-Structure Interaction (RSI) mod-    the VGG network to measure the degree of sim-
            ule, which employs the deformable convolution     ilarity between the generated image x 0 and the
            to align the spatial features between reference   target image x target in terms of deep semantic
            and source glyphs. FontDiffuser has a two-stage   features. The deformation offset loss uses de-
            training policy that leads to gradually learning  formable convolutional networks (DCN) to con-
            to reconstruct the correct structures and simu-   strain the offset of content features δ offset . This
            late style uniformity.In the first phase, only the  is used to prevent the network from generating
            diffusion model is trained without the style con-  excessive offsets or unstable behavior during the
            trastive refinement (SCR) module. The process     generation process.
            is devoted to reconstructing glyphs based on fea-
            ture and structural losses. The total loss is de-
            fined as follows:

                1              1        1                       In Phase 2, the SCR module is switched on to
              L total  = L MSE + λ L cp + λ L offset  (1)
                               cp
                                        off
                                         2                    direct the model to learn style imitation at the
              L MSE = ∥ϵ − ϵ θ (x t , t, x c , x s )∥  (2)
                                                              global and local levels. Style features of every
                       L
                      X                                       style image are obtained using a style extractor.
                L cp =   ∥VGG l (x 0 ) − VGG l (x target )∥ (3)  This phase introduces a new style contrastive
                      l=1                                     loss term (L sc ), which enforces similarity among
              L offset = mean (∥δ offset ∥)           (4)     glyphs from the same font style and separation
   12   13   14   15   16   17   18   19   20   21   22