Page 18 - 2025S
P. 18

UEC Int’l Mini-Conference No.54                                                               11







                                                              comprehend subtle brush patterns, curves, and
                                                              the overall look of the style by digesting these
                                                              two kinds of data. The attention system selects
                                                              the most crucial style elements and gradually in-
                                                              corporates them into the content characteristics.


                                                                By placing this attention mechanism in the
                                                              downblock and upblock of the U-Net of the dif-
            Figure 2: The results generated by FontDiffuser   fusion model ensures that the style is used ev-
            using Bengali font as the input image.            erywhere with the whole network. This gives a
                                                              better and cleaner character in the model partic-
            from other styles. The updated loss becomes:      ularly composite Bangla characters with matras
                                                              (horizontal shapes), loops, and compound char-
               2               2       2          2
              L total  = L MSE + λ L cp + λ L offset + λ L sc  acters. The model is trained how to pair style
                               cp
                                       off
                                                  sc
                                                      (5)     with content in a better way, thus being able,
            Together, the two-phase training approach en-     even after a few examples, to create high-quality
            ables FontDiffuser to produce coherent and        font images.
            stylistically rich font sets, making it an effective  Overall, this dual aggregation cross-attention
            base for further adaptation to complex scripts    content fusion module plays a key role in mak-
            such as Bengali. However, unlike Chinese and      ing our model better at transferring style and
            English, Bengali text has a special structure,    producing Bengali fonts that are both visually
            such as splicing, which makes it difficult for ex-  appealing and structurally correct.
            isting methods to generate clear text. Fig. 2
            shows the result of generating Bengali fonts us-  3.3   Discriminator for Adversarial Su-
            ing FontDiffuser. The result shows that FontD-          pervision
            iffuser is not able to generate the Bengali text
            correctly, there are large distortions in the font,  We add a discriminator at patch level which is
            and it is also fails to apply the Stylistic features  based on a CNN and our purpose of this dis-
            to the text.                                      criminator is to provide a direction in which the
                                                              model is to be trained. Our discriminator is to
                                                              validate that glyph generated is either real or
            3.2   Cross-Attention Content Fusion
                                                              fake. We crop tiny pieces (patches) of the pic-
            As shown in Fig. 3, On the U-Net architec-        ture and say whether they are produced on the
            ture, in our model of Bengali font generation,    basis of the real fonts or developed. The dis-
            we enhance the building block by introducing      criminator and generator (our diffusion model)
            a dual aggregation Cross-Attention Content Fu-    are trained in adversarial fashion. The generator
            sion (CACF) module to the network at every        attempts to produce glyphs that will dupe the
            level of the encoder and the decoder network.     discriminator and the discriminator attempts to
            This module assists the model to improve in-      pick up such fakes. Setting Our layout assists
            tegration of content of a source character with   the generator to produce sharper and more real-
            style of a reference font. Under this scheme, the  istic glyphs particularly in places where details
            source glyph content features are employed as     are fine such as thin strokes, loops, or ornamen-
            queries and style features of the reference glyph  tation ends. Including a discriminator enhances
            are employed as keys and values. This enables     the final output clarity and sharpness as opposed
            the model to use the most beneficial details of   to when only diffusion loss was used during train-
            style in the production of a new glyph. Our       ing. FontDiffuser training method only takes ad-
            dual aggregation cross-attention is intended to   vantage of the diffusion loss to better fine-tune
            concentrate on both the features (channel infor-  the generated glyphs through numerous itera-
            mation) and the location of objects in the im-    tions. Sometimes, it generates less sharp images,
            age (spatial information). The model can better   even though the structure works well. However,
   13   14   15   16   17   18   19   20   21   22   23