Page 43 - 2024S
P. 43

36                                                                UEC Int’l Mini-Conference No.52







            service, given that they currently support such
            modifications. However, this dataset can also
            be adapted for training local language models,
            which can then be integrated into the system.
            The fine-tuning process ensures that the model
            is capable of interpreting nuanced textual cues
            and accurately mapping them to specific odor
            categories, enhancing the olfactory dimension of
            multimedia consumption.


            4 Model Comparison


            To determine the most suitable model for our           Figure 4: Accuracy plot by models.
            system, we conducted an extensive evaluation by
            benchmarking several state-of-the-art language
            models.   We aimed to compare their perfor-
            mance in a contextual understanding classifica-
            tion task, focusing on their accuracy, cost, and
            response time.


            4.1 Experiment Description
            We selected five models from leading AI com-
            panies such as OpenAI, Anthropic, and Google.
            Specifically, we tested gpt-3.5-turbo and fine-
            tuned gpt-3.5-turbo from OpenAI; claude-3
            haiku from Anthropic; and fine-tuned gemini-
            1.0-pro and gemini-1.5-flash from Google. Our
            evaluation dataset consisted of 700 text-odor
            pairs, with 50 samples for each odor class to en-
            sure a balanced and fair assessment. For each             Figure 5: Cost plot by models
            model, we recorded predictions, contextual anal-
            ysis, token usage, and elapsed time, enabling us
            to compare their accuracy, cost-efficiency, and    standing.
            speed.
                                                                Figure 5 presents the average cost per API
                                                              request, calculated based on token usage and
            4.2 Results
                                                              pricing from each service provider.    While
            Figure 4 illustrates the accuracy results for     the most advanced models offered superior
            each model.   Fine-tuned models consistently      accuracy, they were also more expensive. The
            outperformed their non-fine-tuned counter-        fine-tuned gpt-3.5-turbo model emerged as
            parts, demonstrating the significant benefits of  a cost-effective choice, being 24% cheaper
            domain-specific training. The fine-tuned gpt-     than the fine-tuned gemini-1.0-pro model at a
            3.5-turbo and gemini-1.0-pro models achieved      comparable accuracy level. This cost advantage
            the highest accuracy rates, both surpassing       becomes more pronounced with higher usage,
            90%.    Notably, the fine-tuned gpt-3.5-turbo     making gpt-3.5-turbo a more practical option
            model exhibited a 34% improvement over its        for large-scale applications.
            original version, underscoring the effectiveness
            of fine-tuning in enhancing contextual under-       As shown in Figure 6, the fine-tuned gpt-3.5-
   38   39   40   41   42   43   44   45   46   47   48