Page 23 - 2025S
P. 23

16                                                                UEC Int’l Mini-Conference No.54







            Table 3: Quantitative evaluation results of ab-   ital, clean, uniform, and adhere to the typo-
            lation studies. Effectiveness of different mod-   graphic rules. The model used by us is based
            ules. CA and D represent Cross-attention and      on rendered fonts, and this reduces the reference
            Discriminator, respectively. The first row repre-  to handwriting or artistic scripts. It makes a
            sents the Baseline.                               fixed set of templates of ligatures as well. Con-
                                                              trol of style is poor, and the model demands
                                                              a lot of computational resources. We will pro-
                      FID↓  LPIPS↓  L1↓   RMSE↓ SSIM↑
              baseline  0.5554  0.2918  0.1579  0.2430  0.7680  vide handwritten glyphs in the future, allow dy-
              CA      0.2372  0.1647  0.1613  0.2432  0.7714  namic ligature generation, and create style vec-
              CA & D 0.0877  0.0909  0.1031  0.1768  0.8345   tors, which are editable by the user. While we re-
                                                              port perceptual and structural metrics like FID,
                                                              LPIPS, and SSIM, we acknowledge that OCR-
            particularly in maintaining the consistency of
            line thickness and stroke endpoints.              based evaluation (e.g., CER) would better reflect
              The full model, which includes both cross-      real-world usability. Due to time and resource
            attention and a discriminator (CA & D), pro-      constraints, we could not include such evalua-
            duces the most visually pleasing and structurally  tion in this version and also compared only with
            accurate results. With the discriminator, the     FontDiffuser, the most relevant diffusion-based
            model gets to learn how to produce outputs that   baseline for evaluating our improvements. We
            not only adhere to the reference style, but are   plan to explore CER-based benchmarking in fu-
            also incomparable to real font samples, in over-  ture work for deeper task-specific assessment.
            all distribution. This is evident in the final col-  Fig. 5 demonstrates BengaliDiff has issues
            umn before the target, where the generated im-    during the generation of Unseen font with un-
            ages closely resemble the target fonts in terms of  seen characters (UFUC). Although the model
            shape, stroke consistency, and spatial arrange-   produces the content of the character correctly,
            ment. The characters appear more balanced,        most of the time it fails to resemble the style
            smooth, and clean compared to the previous ver-   of the font being referenced. What this implies
            sions.                                            is that our model is retaining the right letter
              Overall, this ablation study clearly highlights  and altering the font style, which is not the ob-
            the contributions of each component in our        jective.  As an example, the strokes, shapes,
            model. The cross-attention mechanism is vital     and general appearance of the characters are
            in transferring stylistic properties of a reference  not perfectly aligned with the reference font.
            whereas the discriminator makes it realistic and  To enhance BengaliDiff, it may also be possi-
            consistent.  These results show that our pro-     ble to train on more diverse fonts (handwritten
            posed modules significantly improve the ability   or artistic fonts) to enhance generalization. Be-
            to generate high-quality Bangla fonts, especially  sides, adding more flexible style modeling (dy-
            for complex characters involving ligatures and    namic ligature support or style vectors that can
            diacritics. This analysis confirms that both ele-  be edited by the user) could allow the model to
            ments are necessary in creation of outputs which  more easily capture reference styles. The design
            are structurally correct and stylistically true to  of lighter-weight architectures or explicit style-
            the reference font.                               contrast mechanisms represents future research
                                                              directions to further improve the performance of
                                                              unseen fonts.
            5    Discussion

            Our results justify the value of each module      6   Conclusion
            that we presented. Cross-attention enhances fine
            stroke transfers, whereas the discriminator en-   In this paper, we present a novel framework
            hances sharpness and fine detail of generated     that involves diffusion-based generation of Ben-
            Bengali fonts. But our model is actually trained  gali font using a dual aggregation cross-attention
            on the persistently executed fonts, which are dig-  and a patch-level CNN-based learning discrimi-
   18   19   20   21   22   23   24   25   26   27   28