Page 18 - 2025S
P. 18
UEC Int’l Mini-Conference No.54 11
comprehend subtle brush patterns, curves, and
the overall look of the style by digesting these
two kinds of data. The attention system selects
the most crucial style elements and gradually in-
corporates them into the content characteristics.
By placing this attention mechanism in the
downblock and upblock of the U-Net of the dif-
Figure 2: The results generated by FontDiffuser fusion model ensures that the style is used ev-
using Bengali font as the input image. erywhere with the whole network. This gives a
better and cleaner character in the model partic-
from other styles. The updated loss becomes: ularly composite Bangla characters with matras
(horizontal shapes), loops, and compound char-
2 2 2 2
L total = L MSE + λ L cp + λ L offset + λ L sc acters. The model is trained how to pair style
cp
off
sc
(5) with content in a better way, thus being able,
Together, the two-phase training approach en- even after a few examples, to create high-quality
ables FontDiffuser to produce coherent and font images.
stylistically rich font sets, making it an effective Overall, this dual aggregation cross-attention
base for further adaptation to complex scripts content fusion module plays a key role in mak-
such as Bengali. However, unlike Chinese and ing our model better at transferring style and
English, Bengali text has a special structure, producing Bengali fonts that are both visually
such as splicing, which makes it difficult for ex- appealing and structurally correct.
isting methods to generate clear text. Fig. 2
shows the result of generating Bengali fonts us- 3.3 Discriminator for Adversarial Su-
ing FontDiffuser. The result shows that FontD- pervision
iffuser is not able to generate the Bengali text
correctly, there are large distortions in the font, We add a discriminator at patch level which is
and it is also fails to apply the Stylistic features based on a CNN and our purpose of this dis-
to the text. criminator is to provide a direction in which the
model is to be trained. Our discriminator is to
validate that glyph generated is either real or
3.2 Cross-Attention Content Fusion
fake. We crop tiny pieces (patches) of the pic-
As shown in Fig. 3, On the U-Net architec- ture and say whether they are produced on the
ture, in our model of Bengali font generation, basis of the real fonts or developed. The dis-
we enhance the building block by introducing criminator and generator (our diffusion model)
a dual aggregation Cross-Attention Content Fu- are trained in adversarial fashion. The generator
sion (CACF) module to the network at every attempts to produce glyphs that will dupe the
level of the encoder and the decoder network. discriminator and the discriminator attempts to
This module assists the model to improve in- pick up such fakes. Setting Our layout assists
tegration of content of a source character with the generator to produce sharper and more real-
style of a reference font. Under this scheme, the istic glyphs particularly in places where details
source glyph content features are employed as are fine such as thin strokes, loops, or ornamen-
queries and style features of the reference glyph tation ends. Including a discriminator enhances
are employed as keys and values. This enables the final output clarity and sharpness as opposed
the model to use the most beneficial details of to when only diffusion loss was used during train-
style in the production of a new glyph. Our ing. FontDiffuser training method only takes ad-
dual aggregation cross-attention is intended to vantage of the diffusion loss to better fine-tune
concentrate on both the features (channel infor- the generated glyphs through numerous itera-
mation) and the location of objects in the im- tions. Sometimes, it generates less sharp images,
age (spatial information). The model can better even though the structure works well. However,