Page 31 - 2024F
P. 31
24 UEC Int’l Mini-Conference No.53
Counting Gapped Repeats in Strings
1
Leon FLAACK and Takuya MIENO 2
1 UEC Exchange Study Program (JUSST Program)
2 Department of Computer and Network Engineering
The University of Electro-Communications, Tokyo, Japan
Abstract
Gapped repeats are strings with the structure uvu. This paper discusses the application of suffix
trees to the problem of counting the number of such gapped repeats within a given input string S
of length N. An algorithm that counts all such structures in S within O(N) time is presented. The
algorithm is based on a previous paper by Popa and Popa [CPM 2021] that solves the similar problem
of counting gapped palindromes. We propose the use of runs instead of maximal palindromes for a
key element of our algorithm.
Keywords: String Algorithms, Gapped Repeats, Suffix Trees, Runs, String Combinatorics
1 Introduction is denoted by S[i..j]. The length of string S is
written as |S|. The empty string is named ε. We
A gapped repeat is a string uvu, where the so- denote |S| by N. We denote by gapped repeat a
called arms, u, are repeated around the gap, string w = uvu, where u and v are arbitrary
v. Thus, a string like abcdeabc would consti- strings and |u| > 0. Note that a gapped repeat
tute a gapped repeat with u = abc and v = w might be split into uvu in multiple ways. In
′ ′ ′
de. In this paper, all such substrings within a this paper we consider uvu and u v u different
′
′
given input string S are counted using the con- if either u ̸= u , v ̸= v or the positions at which
struction of the corresponding suffix tree within the two substrings start are different.
the two main algorithms, Algorithm 1 and Al-
gorithm 2. Algorithm 1 uses the suffix tree to A data structure used in our algorithm is the
(over)count all occurrences of gapped repeats suffix tree introduced by Weiner in [5], also de-
within the input string. This counting is based scribed in Gusfield [2]. A suffix tree of a string
on a modified depth-first traversal of the tree S is a tree T where each edge (u, v) (u is the
and is based on the algorithm presented in [3]. parent of v) is labeled by a nonempty substring
Algorithm 2 then computes the number of ex- S[i..j] of S. In this tree, the concatenation of the
cessively counted occurrences that do not cor- labels of all edges on a path from the root to a
respond to valid gapped repeats. By combining certain node v forms a string which we denote by
the results of both algorithms, we obtain the str(v). A fundamental property of the suffix tree
true number of gapped repeats within S. structure is that any two edges (u, v 1 ), (u, v 2 )
leading to different children of the same node
are labeled by strings which start with different
2 Preliminaries characters. Thus, two different nodes v 1 and v 2
of T have the property that str(v 1 ) ̸= str(v 2 ).
Given a finite alphabet Σ, we define a string Moreover, it holds that lcp(str(v 1 ), str(v 2 )) =
S to be an element of Σ . The i-th character str(lca(v 1 , v 2 )), where lca(v 1 , v 2 ) represents the
∗
of S is called S[i]; a substring of S starting at lowest common ancestor node of the nodes v 1
position i and ending at position j (inclusive) and v 2 and lcp(str(v 1 ), str(v 2 )) is the longest