Page 21 - 2024S
P. 21

14                                                                UEC Int’l Mini-Conference No.52







            4    Discussion                                   4.2   Robust Analysis
                                                              The robustness of the proposed algorithm was
            4.1   Q-Learning Algorithm
                                                              tested by increasing traffic demand to simu-
            The Q-learning algorithm used in this study fol-  late peak hours. The results indicated that the
            lows a reinforcement learning approach, where     proposed algorithm outperformed FT and TAC
            the algorithm learns to optimize traffic signal   even under higher traffic demands.
            timings through trial and error. The key steps
            of the algorithm are as follows:                  5    Conclusions

              • Observe Current State: Analyze the            The proposed Q-learning-based adaptive traffic
                current traffic conditions at the intersec-   signal control algorithm significantly improves
                tion.                                         the operational and safety performance of inter-
                                                              sections, particularly under high MPRs. The al-
              • Choose an Action:       Select an action      gorithm provides a robust and effective solution
                (e.g., change the traffic signal) based on an  for managing traffic flow in urban environments,
                exploration-exploitation strategy.            offering a foundation for future smart city traffic
                                                              management systems.
              • Execute the Action: Implement the cho-          Future work will focus on further refining the
                sen action and observe the result.            algorithm and testing it in more complex ur-
                                                              ban environments. Integrating additional data
              • Update Q-Value: Update the Q-value            sources and enhancing the learning mechanism
                based on the observed reward. If the re-      are potential areas for improvement.
                sult is positive (reduced congestion), the Q-
                value is increased. If the result is negative
                (increased congestion), the Q-value is ad-    6    Acknowledgments
                justed accordingly.
                                                              The authors would like to thank the funding
                                                              support from JASSO Scholarship and the assis-
              • Repeat until Convergence: Continue
                this process until the Q-values stabilize, in-  tance provided by the Advanced Wireless Com-
                                                              munication Research Center (AWCC) at the
                dicating that the algorithm has learned the
                                                              University of Electro-Communications.
                optimal signal timings.

              The Q-learning formula used in this study is:


                                            ′
            Q(s, a) ← Q(s, a)+α[r+γ max Q(s , a )−Q(s, a)]
                                               ′
                                      a
                                                      (2)
            where:
              • Q(s, a) is the current Q-value for state s
                and action a.

              • α is the learning rate.

              • r is the reward.


              • γ is the discount factor.

              • max a Q(s , a ) is the maximum estimated
                            ′
                         ′
                future reward for the next state s .
                                                ′
   16   17   18   19   20   21   22   23   24   25   26