Page 23 - 2025S

P. 23

16 UEC Int’l Mini-Conference No.54

Table 3: Quantitative evaluation results of ab- ital, clean, uniform, and adhere to the typo-
lation studies. Effectiveness of different mod- graphic rules. The model used by us is based
ules. CA and D represent Cross-attention and on rendered fonts, and this reduces the reference
Discriminator, respectively. The first row repre- to handwriting or artistic scripts. It makes a
sents the Baseline. fixed set of templates of ligatures as well. Con-
trol of style is poor, and the model demands
a lot of computational resources. We will pro-
FID↓ LPIPS↓ L1↓ RMSE↓ SSIM↑
baseline 0.5554 0.2918 0.1579 0.2430 0.7680 vide handwritten glyphs in the future, allow dy-
CA 0.2372 0.1647 0.1613 0.2432 0.7714 namic ligature generation, and create style vec-
CA & D 0.0877 0.0909 0.1031 0.1768 0.8345 tors, which are editable by the user. While we re-
port perceptual and structural metrics like FID,
LPIPS, and SSIM, we acknowledge that OCR-
particularly in maintaining the consistency of
line thickness and stroke endpoints. based evaluation (e.g., CER) would better reflect
The full model, which includes both cross- real-world usability. Due to time and resource
attention and a discriminator (CA & D), pro- constraints, we could not include such evalua-
duces the most visually pleasing and structurally tion in this version and also compared only with
accurate results. With the discriminator, the FontDiffuser, the most relevant diffusion-based
model gets to learn how to produce outputs that baseline for evaluating our improvements. We
not only adhere to the reference style, but are plan to explore CER-based benchmarking in fu-
also incomparable to real font samples, in over- ture work for deeper task-specific assessment.
all distribution. This is evident in the final col- Fig. 5 demonstrates BengaliDiff has issues
umn before the target, where the generated im- during the generation of Unseen font with un-
ages closely resemble the target fonts in terms of seen characters (UFUC). Although the model
shape, stroke consistency, and spatial arrange- produces the content of the character correctly,
ment. The characters appear more balanced, most of the time it fails to resemble the style
smooth, and clean compared to the previous ver- of the font being referenced. What this implies
sions. is that our model is retaining the right letter
Overall, this ablation study clearly highlights and altering the font style, which is not the ob-
the contributions of each component in our jective. As an example, the strokes, shapes,
model. The cross-attention mechanism is vital and general appearance of the characters are
in transferring stylistic properties of a reference not perfectly aligned with the reference font.
whereas the discriminator makes it realistic and To enhance BengaliDiff, it may also be possi-
consistent. These results show that our pro- ble to train on more diverse fonts (handwritten
posed modules significantly improve the ability or artistic fonts) to enhance generalization. Be-
to generate high-quality Bangla fonts, especially sides, adding more flexible style modeling (dy-
for complex characters involving ligatures and namic ligature support or style vectors that can
diacritics. This analysis confirms that both ele- be edited by the user) could allow the model to
ments are necessary in creation of outputs which more easily capture reference styles. The design
are structurally correct and stylistically true to of lighter-weight architectures or explicit style-
the reference font. contrast mechanisms represents future research
directions to further improve the performance of
unseen fonts.
5 Discussion

Our results justify the value of each module 6 Conclusion
that we presented. Cross-attention enhances fine
stroke transfers, whereas the discriminator en- In this paper, we present a novel framework
hances sharpness and fine detail of generated that involves diffusion-based generation of Ben-
Bengali fonts. But our model is actually trained gali font using a dual aggregation cross-attention
on the persistently executed fonts, which are dig- and a patch-level CNN-based learning discrimi-

18 19 20 21 22 23 24 25 26 27 28