Page 31 - 2024F
P. 31

24                                                                UEC Int’l Mini-Conference No.53








                                  Counting Gapped Repeats in Strings


                                                       1
                                        Leon FLAACK and Takuya MIENO         2
                                  1 UEC Exchange Study Program (JUSST Program)
                                 2 Department of Computer and Network Engineering
                              The University of Electro-Communications, Tokyo, Japan





                                                       Abstract


                   Gapped repeats are strings with the structure uvu. This paper discusses the application of suffix
                trees to the problem of counting the number of such gapped repeats within a given input string S
                of length N. An algorithm that counts all such structures in S within O(N) time is presented. The
                algorithm is based on a previous paper by Popa and Popa [CPM 2021] that solves the similar problem
                of counting gapped palindromes. We propose the use of runs instead of maximal palindromes for a
                key element of our algorithm.

            Keywords: String Algorithms, Gapped Repeats, Suffix Trees, Runs, String Combinatorics

            1    Introduction                                 is denoted by S[i..j]. The length of string S is
                                                              written as |S|. The empty string is named ε. We
            A gapped repeat is a string uvu, where the so-    denote |S| by N. We denote by gapped repeat a
            called arms, u, are repeated around the gap,      string w = uvu, where u and v are arbitrary
            v. Thus, a string like abcdeabc would consti-     strings and |u| > 0. Note that a gapped repeat
            tute a gapped repeat with u = abc and v =         w might be split into uvu in multiple ways. In
                                                                                              ′ ′ ′
            de. In this paper, all such substrings within a   this paper we consider uvu and u v u different
                                                                                  ′
                                                                           ′
            given input string S are counted using the con-   if either u ̸= u , v ̸= v or the positions at which
            struction of the corresponding suffix tree within  the two substrings start are different.
            the two main algorithms, Algorithm 1 and Al-
            gorithm 2. Algorithm 1 uses the suffix tree to      A data structure used in our algorithm is the
            (over)count all occurrences of gapped repeats     suffix tree introduced by Weiner in [5], also de-
            within the input string. This counting is based   scribed in Gusfield [2]. A suffix tree of a string
            on a modified depth-first traversal of the tree   S is a tree T where each edge (u, v) (u is the
            and is based on the algorithm presented in [3].   parent of v) is labeled by a nonempty substring
            Algorithm 2 then computes the number of ex-       S[i..j] of S. In this tree, the concatenation of the
            cessively counted occurrences that do not cor-    labels of all edges on a path from the root to a
            respond to valid gapped repeats. By combining     certain node v forms a string which we denote by
            the results of both algorithms, we obtain the     str(v). A fundamental property of the suffix tree
            true number of gapped repeats within S.           structure is that any two edges (u, v 1 ), (u, v 2 )
                                                              leading to different children of the same node
                                                              are labeled by strings which start with different
            2    Preliminaries                                characters. Thus, two different nodes v 1 and v 2
                                                              of T have the property that str(v 1 ) ̸= str(v 2 ).
            Given a finite alphabet Σ, we define a string     Moreover, it holds that lcp(str(v 1 ), str(v 2 )) =
            S to be an element of Σ . The i-th character      str(lca(v 1 , v 2 )), where lca(v 1 , v 2 ) represents the
                                    ∗
            of S is called S[i]; a substring of S starting at  lowest common ancestor node of the nodes v 1
            position i and ending at position j (inclusive)   and v 2 and lcp(str(v 1 ), str(v 2 )) is the longest
   26   27   28   29   30   31   32   33   34   35   36